Pivotal's Google Summer of Code 2014: Implementing Clustering Algorithms in MADlib

June 5, 2014 Andreas Scherbaum

featured-SummerOfCodeThis year marks the 10th year of Google Summer of Code (GSoC). Since it’s inception, over 7500 students have developed over 50 million lines of code by working with over 440 open source projects and 7000 open source mentors from 100 countries.

This summer, French computer science student Maxence Ahlouche’s proposal was chosen out of 6313 proposals and he is spending 12 weeks, from mid-May to end-of-August, developing two data science algorithms for MADlib, Pivotal’s open-source library of big data analytics and machine learning algorithms supporting PostgreSQL. The algorithms will also run on PostgreSQL-compatible, massively parallel database services like Pivotal Greenplum and the Pivotal distribution of Apache Hadoop® , Pivotal HD with HAWQ.

For GSoC 2014, Maxence was selected along with 1300 other students who will collectively work with 190 of the world’s top open source organizations, including The Apache Software Foundation, Ceph, CERN, Clojure, Debian, Drupal, FreeBSD, Git, Gnome, GNU, Google, Groovy, Haskell, Mozilla, openSUSE, phpMyAdmin, Python, R, PostgreSQL, Ruby on Rails, The Eclipse Foundation, The Fedora Project, The Linux Foundation, Twitter, WordPress, and Xen. The student efforts cover engineering software for web crawlers, in memory data grids, Javascript libraries, aggressive compilers, porting, integration, cryptography, semantics, self-tuning optimizers, speech recognition, computer vision, robotics, fuzzy visualization, and much more.

Only 28 students were selected from France, and the country ranked 13th in terms of student participants. The five countries with the most student participants were India (401), the United States (161), Germany (78), Sri Lanka (54), and the Russian Federation (51).

How Does Google Summer of Code Work?

For PostgreSQL, the four accepted GSoC projects were index-only scans for GIST, changing unlogged to logged tables, supporting KNN for SP-GIST, and implementing clustering algorithms in MADlib. Maxence is working towards implementing the new features as decided solely by the projects and the mentoring is done by well-known project members. In this case, former GSoC student Atri Sharma, Pivotal Senior Engineer, Hai Qian, and EMC Architect and Advisor, Andreas Scherbaum, are guiding Maxence.

The aim of GSoC is to help “recruit” students as new members of open source projects and establish a long-term relationship, possibly beyond the current project. As part of the project, the students become familiar with the code base, infrastructure, and organization behind the Open Source project. During the process, students contribute real, working code to the fast growth, dynamic, disruptive world of open source software. Later in the summer, students and mentors are invited to a conference on the Google campus in Mountain View.

The Work—Developing Cluster Analysis Tools for MADlib

A common task in data science is clustering or grouping data into sets by similarity. This type of analysis is performed in use cases with gene sequencing, bioinformatics, medical imaging, recommendation engines, search results, data mining, machine learning, pattern recognition, image analysis, information retrieval, robotics, geology, and many other areas. Maxence is developing the k-medoids and the OPTICS clustering algorithms as part of the MADlib open source project.

MADlib provides an open-source framework for separating machine learning logic from database-specific implementation details, allowing data to run locally within the database, and using massively parallel processing (MPP) techniques, similar to MapReduce, for parallelism and scale. It features a toolkit of algorithms for classification, regression, clustering, topic modeling, rule mining, descriptive statistics, validation, time series analysis, and other data science techniques.

The GSoC project is split in two parts:

  1. Implementing the k-medoids algorithm, an interesting algorithm for noisy datasets and related to the already implemented k-means algorithm.
  2. Implementing the OPTICS (ordering points to identify the clustering structure) algorithm to identify density-based clusters in spatial data.

Both sub-projects will come with the necessary code, tests, and documentation. In addition, Maxence will remove duplicate code from the two new sub projects and optionally from other MADlib code.

More About The Clustering Algorithms k-medoids and OPTICS

The k-medoids algorithm is similar to the well-known k-means algorithm and also breaks up data sets into different groups called partitions. It then aims to minimize the distance of each point to the center of the cluster. Unlike k-means, the k-medoids algorithm uses data points as cluster centers. This makes the calculation more robust and minimizes the noise. It also makes the algorithm more computationally intensive.

OPTICS tries to find density-based clusters in spatial data sets. In contrast to its predecessors, OPTICS is able to identify meaningful clusters in sets of varying density. The clustering problem is solved by ordering points linearly and finding the closest neighbors.
The project’s progress will be documented and discussed on the MADlib mailinglist.

Learning More:

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Andreas Scherbaum

Andreas Scherbaum is working with PostgreSQL since 1997. He is involved in several PostgreSQL related community projects, member of the Board of Directors of the European PostgreSQL User Group and also wrote a PostgreSQL book (in German). Since 2011 he is working for EMC/Greenplum/Pivotal and tackles very big databases.

Follow on Twitter Follow on Linkedin Visit Website More Content by Andreas Scherbaum
Previous
Using Data Science Techniques for the Automatic Clustering of IT Alerts
Using Data Science Techniques for the Automatic Clustering of IT Alerts

Large enterprise IT infrastructure technology components generate large volumes of alert messages. Instead ...

Next
Pivotal Receives Morgan Stanley's Exclusive 'CTO Award for Innovation' for 2014
Pivotal Receives Morgan Stanley's Exclusive 'CTO Award for Innovation' for 2014

Last night, Pivotal received a very special award. At their annual CTO Summit event that unites technology ...