MADlib’s Journey To Apache: Math, Stats & Machine Learning Methods

September 30, 2015 Frank McQuillan

sfeatured-MADlib-to-Apache-ALTMADlib is now an Apache Software Foundation Incubator project!

Together with Apache HAWQ (incubating), the MADlib open source project has transitioned its development and governance models to be in accordance with “The Apache Way.” The Apache Software Foundation (ASF) is a widely recognized place for like-minded developers to collaborate on software in open and productive ways. At Pivotal, we view it as the ideal venue to continue developing MADlib technology in innovative directions.

From the beginning, MADlib has been an open-source meeting ground for software developers, computing researchers and data scientists to collaborate on machine learning and statistics in a scalable, in-database context. The Apache Software Foundation community is the ideal forum to grow that constituency and codebase.

-Joe Hellerstein, Professor of Computer Science at UC Berkeley, Co-Founder and Chief Strategy Officer at Trifacta, and one of the originators of MADlib

In this post, we explain what MADlib is, provide a short history, describe why it is moving to the ASF, outline its value in the enterprise, and illustrate the current community membership.

What is MADlib?

MADlib is an open source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. It uses shared-nothing, distributed, scale-out architectures to offer data scientists an effective toolset for challenging problems involving very large data sets. MADlib is SQL-based and currently supports Apache HAWQ (incubating), Pivotal Greenplum Database, and and PostgreSQL platforms.

MADlib also occupies a unique niche in the realm of data science and machine learning libraries—its SQL APIs can allow it to work on a wide range of data stores and SQL engines, also providing a very common language to build from. Currently the toolkit provides algorithms for classification, regression, clustering, topic modeling, association rule mining, descriptive statistics, validation, among others. More details can be found in the latest user guide.

The History of MADlib

MADlib grew out of discussions between database engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a paper from VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and computer scientists at Pivotal (formerly EMC/Greenplum).

The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the University of Wisconsin, and the University of Florida. The project was publicly documented in a paper at VLDB 2012. Today, MADlib has contributors from around the world, including both individuals and institutions.

MADlib was conceived from the outset as a free, open source library for all to use and contribute to. Since its inception, the community has steadily added new methods in the areas of mathematics, statistics, machine learning, and data transformation. Including the named examples mentioned earlier, the current library includes over 30 principle algorithms as well as many additional operators and utility functions.

The first platforms supported by MADlib were Greenplum Database and PostgreSQL. With the development of Pivotal HAWQ, the SQL-on-Hadoop technology, MADlib offered a new way to perform predictive analytics—on data sets stored on a Hadoop cluster.

Today, MADlib is in active development and is deployed on a wide variety of industry and academic projects across many different verticals. Steven Hillion, chief product officer at Alpine Data Labs gave this perspective on MADlib’s history and future:

I was involved in the early development and use of MADlib when I was at Greenplum, and I’ve watched it mature over the years to become a comprehensive, enterprise-grade tool for data scientists. We’ve used MADlib for many years at Alpine Data Labs, in healthcare, manufacturing, finance and many other applications. I’m really encouraged to see the Apache community now helping to ensure that MADlib continues to innovate while maintaining high levels of quality.

Why The Apache Software Foundation?

The open source community behind MADlib felt that aligning itself with the ASF community, governance model, and infrastructure would allow the project to accelerate adoption and community growth. Also, given HAWQ’s trajectory of entering the ASF family as an incubator project itself, we felt that the best course of action for MADlib was to follow a similar path.

MADlib and HAWQ are complementary technologies since MADlib in-database analytic functions can run efficiently within the HAWQ execution engine. We expect that contributors to MADlib will be cognizant of the Apache HAWQ (incubating) project and may contribute to it as well. Collaboration between the two communities will make both projects more vibrant and advance the respective technologies in potentially novel directions.

Contributors may also look at the HAWQ project as a starting point for ports to other parallel database engines. Pivotal encourages this type of work as it would help to further realize the original cross-platform goal of MADlib as envisioned by its originators.

Given the high velocity of innovation happening in the underlying Hadoop ecosystem, any SQL-based predictive analytics technology that plays well in this ecosystem must be commensurately agile to keep up with the community. We strongly believe that, in the Big Data space, this agility can be optimally achieved through a vibrant, diverse, self-governed community—collectively innovating around a single codebase while at the same time cross-pollinating with various other data management communities. The ASF is the ideal place to meet those ambitious goals.

Bringing Value To The Enterprise

First and foremost, most business executives—across all industries and departments—believe that a data-driven organization performs better, and a wide variety of research backs this up. In today’s world, this means two things. First, companies must collectively and centrally store the massive amounts of data they have captured from mobile, social, web, e-commerce, analytical, IoT, and similar other systems. Then, they must go beyond traditional business intelligence and ask data scientists to use intelligent algorithms, like those in MADlib, to help the organization make better decisions and take smarter actions.

From a technical perspective, enterprises today are seeing the value of landing very large quantities of data in Apache Hadoop® clusters with the of goal improving their products and processes. With the proliferation of increasingly sophisticated SQL-on-Hadoop technologies such as HAWQ, analysts can use the familiar SQL language to query this data at scale. A SQL-based interface effectively opens the door to Hadoop in the enterprise, and organizations do not need to re-train their teams on an unfamiliar programming language since SQL skills are ubiquitous.

Adding SQL-based predictive analytics like MADlib to the equation enables organizations to reason across large data sets without resorting to sampling, which has been a traditional approach when confronted with scale problems. Operating on all of the data with MADlib can result in more robust and accurate models.

For data scientists who are used to working in R, PivotalR also provides an R interface to MADlib functions.

Building Community

More than 25 initial contributors were listed on the initial proposal to Apache for entrance into incubating status, and this included representatives from Pivotal, Hortonworks, MapR, WANDisco and Barclays. This group will form a base to extend to the broader community. We invite anyone to come collaborate on the codebase. Users and new contributors will be treated with respect and welcomed. Both software contributions and non-code contributions (documentation, events, community management, etc.) are valued.

At Pivotal, we enthusiastically look forward to working together with all future contributors to MADlib in order to advance the state-of-the-art of scale-out data science tools.

Learning More:

Editor’s Note: Apache, Apache Hadoop, Hadoop and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Frank McQuillan

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

More Content by Frank McQuillan
Case Study: Refactoring a Monolith Into A Cloud-Native Application (Part 2)
Case Study: Refactoring a Monolith Into A Cloud-Native Application (Part 2)

In the first installment of this series, we used the SpringTrader application as an example of an existing ...

Running Your Platform—Upgrade Much?
Running Your Platform—Upgrade Much?

A common question for operations when considering running a platform for the first time is, what do you do ...

Enter curious. Exit smarter.

Register Now