Implementing Adaboost on MPP for Big Data Analytics

January 29, 2015 Regunathan Radhakrishnan

It is an exciting time to be working on machine learning applications, due to the ubiquitous availability of data, ease of access to large-scale distributed computing platforms, and the availability of machine learning libraries on these distributed computing platforms. For certain use cases, a data scientist may find that the machine learning method that they intend to use is already implemented (e.g., Random Forests, Decision Trees, Regression, LDA, etc.). However, if the problem at hand requires a machine learning method that is not available in the library, then the data scientist needs to quickly implement the method in a clever way.

As an example, let’s consider boosting, a general and provably effective method of producing a very accurate prediction rule by combining many less-accurate prediction rules. Adaboost, a popular machine learning method, is a specific instantiation of boosting algorithms introduced in 1995 by Freund and Schapire. Adaboost is handy in machine learning applications where you need a classifier with high accuracy, which can be easily interpreted to see which factors or rules are contributing towards predicting the outcome.

In this blog post, we’ll demonstrate how easy it is to implement Adaboost on Pivotal Greenplum Database (GPDB). GPDB has MPP Architecture and is built for scalable Big Data analytics. Data scientists can benefit from the power of MADlib, which runs on GPDB. Pivotal’s GPDB is flexible enough to allow data scientists to implement a new method that might not be already included in the library (MADlib).

Editor’s Note: For more on new capabilities recently added, see our latest post on the new machine learning methods implemented in MADlib 1.7.

We implemented Adaboost using the same framework outlined here to develop a fraud detection model, as part of an engagement with a large financial organization. The model was able to detect fraudulent new members few days after their account creation and we were also able to explain the rules with large weights in the final classifier, from the decision tree in each round of Adaboost, that contributed to flagging a particular user.

In the following sections, we will briefly describe what Adaboost is, and then describe the implementation itself.

What Is Adaboost?

Adaboost is a popular ensemble classifier that combines the output of several weak learners to obtain a strong classifier. The weak learners could be any other classifier, such as a decision stump, decision tree, logistic regression, SVM, etc. There are two requirements placed upon the weak learners:

The accuracy of these weak learners should at least be better than random guessing for arbitrary, unknown distributions of the training data. For instance, in a binary classification problem, the training data accuracy of the weak learner (which is the percentage of correctly classified examples) should be strictly greater than 0.5.
The weak learner should be able to handle weighted training examples. Given these constraints on the weak learners, Adaboost provides a framework to combine these weak learners to obtain a final classifier whose accuracy is significantly higher than the accuracy of any single model, the weak learners.

In each iteration, Adaboost attempts to improve upon its errors for particular examples in the training set by minimizing the errors for those in the previous model. In each iteration, the weak learners place higher weights on training examples that have been particularly difficult, allowing it to focus on all of the data, rather than ignoring a subset.

Here is the pseudo-code of the Adaboost learning algorithm:

Let us assume we have N training examples {(x₁, y₁), (x₂, y₂) … (x_i, y_i)… (x_N, y_N)} where x_i represents the feature vector for the i^th training example and y_i represents the corresponding label (0 or 1). Let us also initialize a set of weights, w_i, over the set of training examples to be 1/N, equal for each training example initially.

For each iteration t until T
- Step 1: learn a weak classifier (h_t) with current set of weights w_i
- Step 2: compute the training error (ε_t) of the classifier h_t
- Step 3: define α_t = 0.5*(ln(1- ε_t)/ ε_t)
- Step 4: increase the weights on the misclassified examples by a factor of e^α^t and renormalize the weights w_i

During each iteration, the set of weights (w_i) are adjusted in such a way that in the next iteration there is more emphasis on mis-classified examples in the previous round. This ensures that complementary features (rules) are picked during the different rounds of Adaboost. As a result, Adaboost’s key benefit is that it can create a non-linear decision boundary for the classification problem at hand by combining the decision boundaries of these weak learners from different iterations.

The final Adaboost classifier is then given by H(x_i) = Σ_t α_t h_t(x_i) where h_t(x_i) is the decision of the t^th weak classifier and α_t is the corresponding weight given to that decision in the final classifier. If the weak learners (h_ts) are decision trees, then you can intuitively imagine the final classifier as a way to combine multiple rules with a certain weight (α_t) for each of the rules. For more details on the Adaboost algorithm, please refer to Freund’s Introduction to Boosting paper.

Adaboost Implementation on GPDB

Now that we have explained the algorithm, let’s focus on implementing Adaboost. For this example, we’ll imagine training this classifier on a huge data set which contains millions of examples. Of the four steps mentioned above, step 1 is the most compute-intensive step, requiring learning the weak classifier on all the training examples. We can overcome this limitation by using a powerful machine learning library in a distributed framework which can handle and compute large-scale data, such as MADlib in GPDB, or HAWQ running on Pivotal’s Apache Hadoop® distribution, Pivotal HD.

Below we will demonstrate how this can be done using PivotalR, which provides a convenient R front end to interact with Pivotal GPDB and Pivotal HD/HAWQ for Big Data analytics. PivotaR also provides access to MADlib’s scalable machine learning functions.

For those comfortable with writing SQL code, this could be done without the use of PivotalR. However, in this instance we will use PivotalR as the driver code to implement the Adaboost iterations. We will call one of the machine learning methods available in MADlib to fit a weak learner on the complete set of training examples. Figure 1 below illustrates the four steps of the Adaboost algorithm. Note that the computationally-intensive steps 1, 2 and 4 are run in-database using PivotalR’s ability to call MADlib, whereas step 3 is run locally on the R client. In step 4, where weights on the training examples are adjusted and renormalized, PivotalR doesn’t get a local copy of the weights to perform this operation but does everything in-database. If the training set has millions of examples, the weight vector is also of the same dimension, which can potentially slow down the weight update if it is downloaded to the R client.

Figure 1: Adaboost Implementation using PivotalR and GPDB

Figure 1: Adaboost Implementation using PivotalR and Greenplum Database

The advantage of this implementation is that we don’t have to be concerned about memory limitations when we are working with large datasets and don’t have to create a random sample. The in-database machine learning library, MADlib, allows us to build models on the entire dataset. We can still learn an Adaboost classifier using all of the examples, even if the number of training examples is in the order of tens of millions.

As mentioned earlier, an implementation is also possible which uses PL-PGSQL as a driver for PivotalR. In this case, step 3 is performed inside the PL-PGSQL script instead of being executed from an R client. With the availability of other algorithms allowing the input of weighted training examples, the use of models other than decision trees (e.g., weighted least squares) is possible using this implementation.

Key Takeways and Applications

In this blog post, we have shown how you could implement Adaboost using the flexibility and MPP power of Pivotal’s Greenplum DB. We implemented this using PivotalR , which can call MADlib functions in GPDB. For an R user, this provides the best of both worlds: the ability to code algorithms in R, and to harness the power of MPP, without having to learn SQL. For SQL users, the implementation can easily be translated to run from a PL-PGSQL function.

Code Snippet

## AdaBoost

## Adaboost function. The algorithm on pg. 339 of “The Elements of
## Statistical Learning (2nd)”
adaboost { formula print(formula) n dat #initialize parameters dat$w alpha ep models

#begin adaboost loop
for (i in 1:maxit) {

#step 1: train weak learner
train dat g

#step 2: measure performance on training data
models[[i]] p pp

ep[i] print(ep[i])

#step 3: compute alpha
alpha[i]

#step 4: modify weights and renormalize
w dat dat$w

dat

}

return (list(models = models, alpha = alpha, error = ep))
}

tree {

#call to madlib’s decision tree function…setting maxdepth=1 gives a decision stump as weak classifier. Can be set to any other value
madlib.rpart(formula, data = data, id = names(id), weights = 'w', control = list(maxdepth = 1)) }

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

Pivotal and VMware Preview Turnkey Cloud Native Platform

Today Pivotal and VMware announced plans to provide a joint solution that marries VMware’s new scale out in...

Next-Gen Cloud-Native Platform: VMware Photon + Pivotal Cloud Foundry

Today, Pivotal and VMware are announcing our intent to deliver Pivotal Cloud Foundry with the VMware Photon...

Implementing Adaboost on MPP for Big Data Analytics

What Is Adaboost?

Adaboost Implementation on GPDB

Key Takeways and Applications

About the Author

Previous

Next

Related content in this Stream

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.