This year marks the 10th year of Google Summer of Code (GSoC). Since it’s inception, over 7500 students have developed over 50 million lines of code by working with over 440 open source projects and 7000 open source mentors from 100 countries.
This summer, French computer science student Maxence Ahlouche’s proposal was chosen out of 6313 proposals and he is spending 12 weeks, from mid-May to end-of-August, developing two data science algorithms for MADlib, Pivotal’s open-source library of big data analytics and machine learning algorithms supporting PostgreSQL. The algorithms will also run on PostgreSQL-compatible, massively parallel database services like Pivotal Greenplum and the Pivotal distribution of Apache Hadoop® , Pivotal HD with HAWQ.
Only 28 students were selected from France, and the country ranked 13th in terms of student participants. The five countries with the most student participants were India (401), the United States (161), Germany (78), Sri Lanka (54), and the Russian Federation (51).
How Does Google Summer of Code Work?
For PostgreSQL, the four accepted GSoC projects were index-only scans for GIST, changing unlogged to logged tables, supporting KNN for SP-GIST, and implementing clustering algorithms in MADlib. Maxence is working towards implementing the new features as decided solely by the projects and the mentoring is done by well-known project members. In this case, former GSoC student Atri Sharma, Pivotal Senior Engineer, Hai Qian, and EMC Architect and Advisor, Andreas Scherbaum, are guiding Maxence.
The aim of GSoC is to help “recruit” students as new members of open source projects and establish a long-term relationship, possibly beyond the current project. As part of the project, the students become familiar with the code base, infrastructure, and organization behind the Open Source project. During the process, students contribute real, working code to the fast growth, dynamic, disruptive world of open source software. Later in the summer, students and mentors are invited to a conference on the Google campus in Mountain View.
The Work—Developing Cluster Analysis Tools for MADlib
A common task in data science is clustering or grouping data into sets by similarity. This type of analysis is performed in use cases with gene sequencing, bioinformatics, medical imaging, recommendation engines, search results, data mining, machine learning, pattern recognition, image analysis, information retrieval, robotics, geology, and many other areas. Maxence is developing the k-medoids and the OPTICS clustering algorithms as part of the MADlib open source project.
MADlib provides an open-source framework for separating machine learning logic from database-specific implementation details, allowing data to run locally within the database, and using massively parallel processing (MPP) techniques, similar to MapReduce, for parallelism and scale. It features a toolkit of algorithms for classification, regression, clustering, topic modeling, rule mining, descriptive statistics, validation, time series analysis, and other data science techniques.
The GSoC project is split in two parts:
- Implementing the k-medoids algorithm, an interesting algorithm for noisy datasets and related to the already implemented k-means algorithm.
- Implementing the OPTICS (ordering points to identify the clustering structure) algorithm to identify density-based clusters in spatial data.
Both sub-projects will come with the necessary code, tests, and documentation. In addition, Maxence will remove duplicate code from the two new sub projects and optionally from other MADlib code.
More About The Clustering Algorithms k-medoids and OPTICS
The k-medoids algorithm is similar to the well-known k-means algorithm and also breaks up data sets into different groups called partitions. It then aims to minimize the distance of each point to the center of the cluster. Unlike k-means, the k-medoids algorithm uses data points as cluster centers. This makes the calculation more robust and minimizes the noise. It also makes the algorithm more computationally intensive.
OPTICS tries to find density-based clusters in spatial data sets. In contrast to its predecessors, OPTICS is able to identify meaningful clusters in sets of varying density. The clustering problem is solved by ordering points linearly and finding the closest neighbors.
The project’s progress will be documented and discussed on the MADlib mailinglist.
- Read more about Madlib or PostgreSQL
- Check out more details about the Google Summer of Code
- Find out more about parallel processing of Madlib algorithms on Pivotal Greenplum MPP Database or Pivotal’s Hadoop® Distribution, Pivotal HD, with HAWQ
- Get more info on the algorithms mentioned in this article: k-means, k-medoids, and OPTICS
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author
Andreas Scherbaum is working with PostgreSQL since 1997. He is involved in several PostgreSQL related community projects, member of the Board of Directors of the European PostgreSQL User Group and also wrote a PostgreSQL book (in German). Since 2011 he is working for EMC/Greenplum/Pivotal and tackles very big databases.Follow on Twitter More Content by Andreas Scherbaum