Joint work provided by Xiaocheng Tang and Frank McQuillan.
Support vector machines are an important family of algorithms in machine learning and data science, and they are now available in the newest release of the open source Apache MADlib (incubating) library.
SVMs have found widespread use in such diverse domains as astrophysics, bioinformatics, and text categorization. For example, biodiversity researchers want to classify animals captured by remote cameras so they can understand the population and health of various species in the wild. To do so, features can be extracted from digital images using common methods such as histogram of gradients (HOG) and scale invariant feature transform (SIFT). A model can be trained using SVM on a large dataset of labeled (known) animal images using the extracted features. When an unknown animal image from a remote camera is input to the model, it can determine if the image is in the same class as the training set or not (e.g., mountain lion or not).
Apache MADlib (incubating) is a SQL-based, open source library for scalable in-database analytics that supports Greenplum Database, Pivotal HDB and PostgreSQL. The library offers data scientists numerous distributed implementations of mathematical, statistical and machine learning algorithms for structured and unstructured data. These algorithms are used by data scientists to solve real-world problems across a wide variety of domains.
In a previous post, we described new features in the recent MADlib 1.9 release, with a focus on path functions.
Introduction To Support Vector Machines
Support Vector Machines (SVMs) are supervised machine learning algorithms for regression and classification tasks, although they are more commonly used for classification.
SVM models have two particularly desirable features: robustness in the presence of noisy data and applicability to a variety of data configurations. At its core, an SVM model is a hyperplane separating two distinct classes of data (in the case of classification problems). For cases that are linearly separable, the distance between the hyperplane and the nearest training data points is maximized. Vectors that lie on the margin are called support vectors (Figure 1).
Figure 1: Separating Plane for Linearly Separable Case
In reality, it is often not possible to cleanly segregate two classes using a straight line, since some of the members of one class may lie in the territory of another class, due to outliers in the data set. In this case, SVM will find a hyperplane that is optimal in the sense that it will maximize the margin while minimizing loss in the training data.
Training SVM At Scale
Training SVM boils down to solving a nonlinear programming problem. Similar to other problems in nonlinear programming, SVM adopts the primal and dual formulations, and solving these two formulations results in equivalent solutions due to strong duality in SVM. There are different trade-offs between the primal and the dual, and as we will talk about later, the dual can be easily generalized to kernel machine using the kernel trick, which many SVM solvers exploit.
Back in the days when problem size was in the order of hundreds or thousands, SVM could be trained in the blink of an eye using the interior point method which is available in many off-the-shelf nonlinear solvers. Later, as problem sizes grew, these general solvers become insufficient due to high iteration complexity and prohibitive memory consumption.
Then LIBSVM came along, and soon became arguably the most popular specialized solver for SVM. Under the hood, LIBSVM optimizes the dual formulation in a block coordinate descent fashion which avoids storing the full dense kernel matrix in memory. LIBSVM scales SVM to problem size of hundreds of thousands and is still widely used; examples are the e1071 package in R and scikit-learn in Python.
In MADlib, the problems we face often contains millions or billions of data points stored in a distributed computing platform such as Greenplum database or Pivotal HDB. LIBSVM, as a fundamentally single node algorithm, fails to scale in this case due to high communication costs demanded by the block coordinate descent update.
In order to train SVM beyond a single machine, we have to rethink the optimization algorithm from scratch. In doing so, we make two key design decisions:
- We directly work with the primal formulation. The number of variables in this case is the dimension of the data points, which is independent of the rows in the data. For example, given a training data table of one billion rows and one thousand columns, the primal problem only needs to deal with one thousand variables, whereas the dual problem needs to deal with the number of support vectors, which could be proportional to the total 1B row data size.
- We distribute the model to each segment and update the model asynchronously, using only local data points within the segment. This is in theory equivalent to applying parallel block coordinate descent in the dual space. The key to make sure the parallel updates converge is to take the convex combination of models from different segments in a synchronous step. One way to determine the coefficients in the combination is to use the proportion of local segment data size to total data size.
Figure 2 below illustrates the general algorithmic scheme in our implementation. We directly solve the primal problem with the master coordinating updates with the segments. Each segment updates its model (m1, m2, …) asynchronously using stochastic subgradient descent (SGD), after which the master aggregates the model synchronously after all segments have completed a pass. Then, the master broadcasts the updated model to each segment for the next iteration. This process is repeated until convergence.
Figure 2: Master Aggregates Models from Segments and Broadcasts Updates
Generalizing to Nonlinear Kernels
There are many problems when a linear hyperplane is not adequate to separate the data due to the fundamental characteristics of the data itself. In such cases it is desirable to have a decision boundary that is not linear. MADlib currently supports polynomial and Gaussian (radial basis function) kernels, which are two of the most common kernels known to cover a wide variety of practical use cases.
Unlike the dual formulation used by LIBSVM, generalizing the primal problem to kernels is not straightforward. Instead, we use random feature maps to approximate the kernel. This embeds the data into a finite dimensional feature space so that the inner product in the transformed space approximates the inner product in the kernel space. In other words, a linear SVM trained in the transformed space can be treated as a kernel SVM in the original space. Hence the approach we describe above to train linear SVM can be readily applied.
Compared to solving the dual with kernel, one advantage of using random feature maps is that testing can be done much more efficiently with a lower memory footprint, because we no longer need to store all the support vectors and compute the kernel product with each one. In fact, we just need to apply the same transformation once to the test data, and the rest will be the same as testing with linear SVM. The cost of the transformation depends on the kernel that has been used, but oftentimes it can be implemented efficiently in distributed fashion and only requires one pass through the whole dataset.
For example, the main cost of approximating a Gaussian kernel involves multiplying the data table by a random matrix. The random matrix is independent of data size and often small enough that it can be efficiently broadcast to each tuple of the data table. Hence the matrix multiplication can be decomposed into a series of matrix-vector products that can be done independently in parallel.
Performance and Scalability
Figures 3a-c show some results of the run-time performance of the new MADlib SVM module for training and prediction*. These results are for classification, but the performance for regression is similar. Note linear scalability with number of rows and number of features. Also note the efficiency of prediction algorithm.
Figure 3a: SVM Training Time for 1M, 10M, 100M Rows (100 Features, 100 Components**)
Figure 3b: SVM Training Time for 10, 100, 1000 Features (10M Rows, 100 Components)
Figure 3c: SVM Prediction Time for 1M, 10M, 100M Rows (100 Features, 100 Components)
* Using a Pivotal Data Computing Appliance (DCA) half-rack for GPDB 126.96.36.199 with 8 nodes and 6 segments per node.
** Number of components is dimensionality in transformed space.
- Read the MADlib 1.9 release notes, download the source code or binaries, or join the user forum
- Read the MADlib SVM user documentation
- Find out more about Greenplum Database or Apache HAWQ/Pivotal HDB
- Read other articles from Pivotal Data Scientists
About the AuthorMore Content by Xiaocheng Tang