Implementing Adaboost on MPP for Big Data Analytics

January 29, 2015 Regunathan Radhakrishnan

featured-lotsoquationsIt is an exciting time to be working on machine learning applications, due to the ubiquitous availability of data, ease of access to large-scale distributed computing platforms, and the availability of machine learning libraries on these distributed computing platforms. For certain use cases, a data scientist may find that the machine learning method that they intend to use is already implemented (e.g., Random Forests, Decision Trees, Regression, LDA, etc.). However, if the problem at hand requires a machine learning method that is not available in the library, then the data scientist needs to quickly implement the method in a clever way.

As an example, let’s consider boosting, a general and provably effective method of producing a very accurate prediction rule by combining many less-accurate prediction rules. Adaboost, a popular machine learning method, is a specific instantiation of boosting algorithms introduced in 1995 by Freund and Schapire. Adaboost is handy in machine learning applications where you need a classifier with high accuracy, which can be easily interpreted to see which factors or rules are contributing towards predicting the outcome.

In this blog post, we’ll demonstrate how easy it is to implement Adaboost on Pivotal Greenplum Database (GPDB). GPDB has MPP Architecture and is built for scalable Big Data analytics. Data scientists can benefit from the power of MADlib, which runs on GPDB. Pivotal’s GPDB is flexible enough to allow data scientists to implement a new method that might not be already included in the library (MADlib).

Editor’s Note: For more on new capabilities recently added, see our latest post on the new machine learning methods implemented in MADlib 1.7.

We implemented Adaboost using the same framework outlined here to develop a fraud detection model, as part of an engagement with a large financial organization. The model was able to detect fraudulent new members few days after their account creation and we were also able to explain the rules with large weights in the final classifier, from the decision tree in each round of Adaboost, that contributed to flagging a particular user.

In the following sections, we will briefly describe what Adaboost is, and then describe the implementation itself.

What Is Adaboost?

Adaboost is a popular ensemble classifier that combines the output of several weak learners to obtain a strong classifier. The weak learners could be any other classifier, such as a decision stump, decision tree, logistic regression, SVM, etc. There are two requirements placed upon the weak learners:

  • The accuracy of these weak learners should at least be better than random guessing for arbitrary, unknown distributions of the training data. For instance, in a binary classification problem, the training data accuracy of the weak learner (which is the percentage of correctly classified examples) should be strictly greater than 0.5.
  • The weak learner should be able to handle weighted training examples. Given these constraints on the weak learners, Adaboost provides a framework to combine these weak learners to obtain a final classifier whose accuracy is significantly higher than the accuracy of any single model, the weak learners.

In each iteration, Adaboost attempts to improve upon its errors for particular examples in the training set by minimizing the errors for those in the previous model. In each iteration, the weak learners place higher weights on training examples that have been particularly difficult, allowing it to focus on all of the data, rather than ignoring a subset.

Here is the pseudo-code of the Adaboost learning algorithm:

Let us assume we have N training examples {(x1, y1), (x2, y2) … (xi, yi)… (xN, yN)} where xi represents the feature vector for the ith training example and yi represents the corresponding label (0 or 1). Let us also initialize a set of weights, wi, over the set of training examples to be 1/N, equal for each training example initially.

  • For each iteration t until T
    • Step 1: learn a weak classifier (ht) with current set of weights wi
    • Step 2: compute the training error (εt) of the classifier ht
    • Step 3: define αt = 0.5*(ln(1- εt)/ εt)
    • Step 4: increase the weights on the misclassified examples by a factor of eαt and renormalize the weights wi

During each iteration, the set of weights (wi) are adjusted in such a way that in the next iteration there is more emphasis on mis-classified examples in the previous round. This ensures that complementary features (rules) are picked during the different rounds of Adaboost. As a result, Adaboost’s key benefit is that it can create a non-linear decision boundary for the classification problem at hand by combining the decision boundaries of these weak learners from different iterations.

The final Adaboost classifier is then given by H(xi) = Σt αt ht(xi) where ht(xi) is the decision of the tth weak classifier and αt is the corresponding weight given to that decision in the final classifier. If the weak learners (ht s) are decision trees, then you can intuitively imagine the final classifier as a way to combine multiple rules with a certain weight (αt) for each of the rules. For more details on the Adaboost algorithm, please refer to Freund’s Introduction to Boosting paper.

Adaboost Implementation on GPDB

Now that we have explained the algorithm, let’s focus on implementing Adaboost. For this example, we’ll imagine training this classifier on a huge data set which contains millions of examples. Of the four steps mentioned above, step 1 is the most compute-intensive step, requiring learning the weak classifier on all the training examples. We can overcome this limitation by using a powerful machine learning library in a distributed framework which can handle and compute large-scale data, such as MADlib in GPDB, or HAWQ running on Pivotal’s Apache Hadoop® distribution, Pivotal HD.

Below we will demonstrate how this can be done using PivotalR, which provides a convenient R front end to interact with Pivotal GPDB and Pivotal HD/HAWQ for Big Data analytics. PivotaR also provides access to MADlib’s scalable machine learning functions.

For those comfortable with writing SQL code, this could be done without the use of PivotalR. However, in this instance we will use PivotalR as the driver code to implement the Adaboost iterations. We will call one of the machine learning methods available in MADlib to fit a weak learner on the complete set of training examples. Figure 1 below illustrates the four steps of the Adaboost algorithm. Note that the computationally-intensive steps 1, 2 and 4 are run in-database using PivotalR’s ability to call MADlib, whereas step 3 is run locally on the R client. In step 4, where weights on the training examples are adjusted and renormalized, PivotalR doesn’t get a local copy of the weights to perform this operation but does everything in-database. If the training set has millions of examples, the weight vector is also of the same dimension, which can potentially slow down the weight update if it is downloaded to the R client.

Figure 1: Adaboost Implementation using PivotalR and GPDB

Figure 1: Adaboost Implementation using PivotalR and Greenplum Database

The advantage of this implementation is that we don’t have to be concerned about memory limitations when we are working with large datasets and don’t have to create a random sample. The in-database machine learning library, MADlib, allows us to build models on the entire dataset. We can still learn an Adaboost classifier using all of the examples, even if the number of training examples is in the order of tens of millions.

As mentioned earlier, an implementation is also possible which uses PL-PGSQL as a driver for PivotalR. In this case, step 3 is performed inside the PL-PGSQL script instead of being executed from an R client. With the availability of other algorithms allowing the input of weighted training examples, the use of models other than decision trees (e.g., weighted least squares) is possible using this implementation.

Key Takeways and Applications

In this blog post, we have shown how you could implement Adaboost using the flexibility and MPP power of Pivotal’s Greenplum DB. We implemented this using PivotalR , which can call MADlib functions in GPDB. For an R user, this provides the best of both worlds: the ability to code algorithms in R, and to harness the power of MPP, without having to learn SQL. For SQL users, the implementation can easily be translated to run from a PL-PGSQL function.

Code Snippet

## AdaBoost

## Adaboost function. The algorithm on pg. 339 of “The Elements of
## Statistical Learning (2nd)”

adaboost {
formula print(formula)
n dat
#initialize parameters
dat$w alpha ep models

#begin adaboost loop
for (i in 1:maxit) {

#step 1: train weak learner
train dat g

#step 2: measure performance on training data
models[[i]] p pp

ep[i] print(ep[i])

#step 3: compute alpha
alpha[i]

#step 4: modify weights and renormalize
w dat dat$w

dat

}

return (list(models = models, alpha = alpha, error = ep))
}

tree {

#call to madlib’s decision tree function…setting maxdepth=1 gives a decision stump as weak classifier. Can be set to any other value
madlib.rpart(formula, data = data, id = names(id), weights = 'w', control = list(maxdepth = 1))
}

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

More Content by Regunathan Radhakrishnan
Previous
This Month in Data Science: January 2015
This Month in Data Science: January 2015

As thought leaders, practitioners, and journalists looked forward to the year ahead in January, they identi...

Next
Pivotal Opens Chicago Office
Pivotal Opens Chicago Office

Announcing that the Pivotal Chicago office is now open! To celebrate Pivotal’s new accessibility for local ...

Enter curious. Exit smarter.

Register Now