Apache MADlib Comes of Age

October 6, 2017 Frank McQuillan

MADlib has graduated to a Top Level Project in the Apache Software Foundation (ASF), signifying that the community has been well-governed under the ASF's meritocratic process and principles. For Pivotal, this means accelerated innovation in the area of in-database machine learning and advanced analytics for Greenplum Database.

In this post, we describe the journey of MADlib from its roots as an open source project to the ASF, and its use by data scientists to solve real-world problems across a wide variety of industries.

What is MADlib?

MADlib is an open source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical, graph and machine learning methods for structured and unstructured data. It uses shared-nothing, distributed, scale-out architectures to offer data scientists an effective toolset for challenging problems involving very large data sets. MADlib is SQL-based and supports Pivotal Greenplum Database and PostgreSQL.

Why was MADlib Developed?

MADlib was originally developed to support a departure from traditional enterprise data warehouses and business intelligence solutions. These had been successful for enterprise reporting and descriptive analytics needs, but were poorly suited for advanced predictive analytics use cases. These new analytics required fast access to massive data sets with highly iterative, parallelizable algorithms. Traditional EDW and BI solutions lacked the necessary performance capabilities, and implementing advanced algorithms required convoluted SQL that was difficult to construct and maintain.

By contrast, MADlib provides machine learning, graph, data utilities and other advanced analytics capabilities that permits data scientists, data engineers and others to work in an integrated manner within a single platform, reducing friction and drag to data science workflows. When paired with an MPP analytic data warehouse like Greenplum, data scientists can develop many models in parallel. This is helpful for many types of use cases, such as modeling large populations at the entity level (e.g., individual customer tendencies). Also, MADlib enables users to invoke advanced algorithms via SQL, rather than requiring SQL analysts to write them. The result is an increase in business value to the enterprise as derived from the data, and therefore better products and services for their customers.

Origins of MADlib

"MADlib was conceived from the outset as an open-source meeting ground for software developers, computing researchers and data scientists to collaborate on scalable, in-database machine learning and statistics," said Joe Hellerstein, Professor of Computer Science at UC Berkeley, Co-Founder and Chief Strategy Officer at Trifacta, and one of the original developers of MADlib.

These discussions were written up in a paper from VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and computer scientists at Pivotal (formerly EMC/Greenplum).

The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the University of Wisconsin, and the University of Florida. The project was publicly documented in a paper at VLDB 2012.

Journey to the ASF

In September 2015, MADlib joined the ASF community as an incubating project. At the time, the open source community behind MADlib felt that aligning with the ASF community, governance model, and infrastructure would allow the project to accelerate adoption and community growth. There were five releases of MADlib as an incubating project, along with a growing number of industry and academic contributors and users.

In July 2017, MADlib graduated to a Top Level Project at the ASF, followed shortly by the first top level release of MADlib 1.12 in August 2017. This latest release includes: new graph analytics (all pairs shortest path, weakly connected components, breadth first search, multiple graph measures), new sampling algorithms (stratified sampling, train-test split) and a multilayer perceptron, which is a type of artificial neural network. Read more about the 1.12 release here.

Enterprise Deployments

The recent announcement of Greenplum 5 reinforced the value proposition of a single platform that can perform compute-intensive and complex analytical workloads at scale. In the past, many enterprises have deployed separate platforms in an attempt to gain insight from data using different techniques. For example, in addition to running SQL workloads on an Enterprise Data Warehouse (EDW) for business intelligence, they may deploy and manage separate databases for graph, geospatial, text, machine learning, etc.

Greenplum 5 is designed to eliminate data silos by integrating traditional and advanced analytics in a single scale-out analytics platform.

In concert with Greenplum’s MPP architecture, MADlib’s wide range of statistical and machine learning methods can cover a variety of real-world use cases, including:

Customer experience (see “Data Science Reveals Extraordinary Insights Into Drivers and Their Behavior”)
Information security (“A Data Science Approach to Detecting Insider Security Threats”)
Churn prediction (“Churn Prediction in Retail Finance and Asset Management”)

"At Pivotal, we have seen our customers successfully deploy MADlib on large scale data science projects across a wide variety of industry verticals," said Elisabeth Hendrickson, Vice President, R&D for Data at Pivotal. "As MADlib graduates to a Top-Level Project at the ASF, we anticipate increased adoption in the enterprise given the mature level of the codebase and the active developer community."

Continued Innovation

At Pivotal, we enthusiastically look forward to working together with all future contributors as part of the MADlib community in order to advance the state-of-the-art of scale-out data science tools.

There are many potential avenues for future development, including expanding the library of graph analytics algorithms, adding new machine learning capabilities and supporting evolving deep learning frameworks. If you have an idea, you are welcome to contribute to the open source project.

"It has been great to witness the growth of the MADlib community and codebase as an ASF incubating project, and I look forward to this continuing as a Top Level Project," added Hellerstein.

Learning More:

Read the ASF press release on MADlib graduation
Greenplum 5 announcement
MADlib home: madlib.apache.org

About the Author

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

SVP: The Shoddiest Viable Product

“If your product is a swiss army knife,” the workshop leader told us, “then your MVP is this simple pocket ...

Detecting Risky Assets in an Organization Using Time-Variant Graphical Model

Apache MADlib Comes of Age

What is MADlib?

Why was MADlib Developed?

Origins of MADlib

Journey to the ASF

Enterprise Deployments

Continued Innovation

Learning More:

About the Author

Previous

Next

Apache MADlib Comes of Age

What is MADlib?

Why was MADlib Developed?

Origins of MADlib

Journey to the ASF

Enterprise Deployments

Continued Innovation

Learning More:

About the Author

Previous

Next

Related content in this Stream

We're excited to announce the release of VMware Tanzu Platform. VMware Tanzu Platform empowers enterprises to accelerate application development, deployment, and management at scale

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.