Google Summer of Code 2014 has wrapped up and updated Pivotal’s MADlib project, an open source toolset for big data machine learning in SQL. The codebase now includes the implementation of a new analytics algorithm.
French computer science student Maxence Ahlouche originally proposed implementing two clustering algorithms for MADlib: k-medoids and OPTICS. Over the course of GSoC, we changed the goals. We left out OPTICS, and Maxence proposed to refactor k-means and k-medoids to use the same code base. This cleaned up a lot of duplicate code and makes everything more readable and easier to use.
Just in time for the deadline, the implementation for both the PostgreSQL version as well as the Greenplum/HAWQ version was finished along with the proper tests and documentation. Maxence also stated that he will continue his work and implement the OPTICS algorithm. After all, that is what the Google Summer of Code program is for—bringing open source projects and students together and offering them an easy way to contribute to the project.
The code from Maxence will be audited by the Pivotal team, namely by Pivotal Senior Engineer Hai Qian. Then, it will be added to the MADlib code base in the upcoming release.
Pivotal wants to thank all participants:
Maxence Ahlouche: for the excellent work during the summer
Atri Sharma: for mentoring the project
Hai Qian: for countless input and hints
Caleb Welton: for support from Pivotal
Andreas Scherbaum (me)
Special thanks goes to the PostgreSQL Global Development Group for enabling us to participate in the GSoC program.
For those that are unfamiliar with MADlib, the project sits at the intersection of commercial efforts, academic research, and open source development. The project is built from the ground up to operate in distributed computing environments and massively parallel processing databases. With Pivotal Greenplum, the data can be operated on locally within a shared-nothing architecture. To date the library supports algorithms like classification, regression, clustering, topic modeling, association rule mining, descriptive statistics, validation, and more.
- Read more about Madlib or PostgreSQL
- Check out more details about the Google Summer of Code
- Find out more about parallel processing of Madlib algorithms on Pivotal Greenplum MPP Database or Apache Hadoop®, Pivotal HD, with HAWQ
- Read more blog articles on Pivotal Data Science or Big Data
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author
Andreas Scherbaum is working with PostgreSQL since 1997. He is involved in several PostgreSQL related community projects, member of the Board of Directors of the European PostgreSQL User Group and also wrote a PostgreSQL book (in German). Since 2011 he is working for EMC/Greenplum/Pivotal and tackles very big databases.Follow on Twitter More Content by Andreas Scherbaum