MADlib 1.7 is now available!
MADlib is a SQL-based open source library for scalable in-database analytics that supports PostgreSQL, Pivotal Greenplum Database, and Pivotal HAWQ. The library gives data scientists a ready-to-use set of algorithms that accelerate time to insight. It offers more than 30 data parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data, and these algorithms are used by data scientists to solve complex problems across a wide variety of domains from financial services to healthcare to academic research.
MADlib 1.7 adds the following capabilities:
- Generalized linear models—a class of supervised learning algorithms that is a generalization of linear regression
- Decision trees (completely new and improved implementation)—a supervised learning method that predicts the value of a target variable based on several input variables and can run up to 40 times faster than the previous version
- Random forest (completely new and improved implementation)—uses an ensemble of classifiers, each of which produces a tree modeled on some combination of the input data and includes both variable importance metrics plus an ability to explore each tree in the forest independently
Let’s take a closer look at each of these.
Generalized Linear Models
The Generalized Linear Model (GLM) is a class of supervised learning algorithms. As its name suggests, it is a generalization of linear regression. GLM involves relating a linear predictor (i.e., a linear combination of explanatory variables) to a response variable. A link function expresses the relationship between the response variable and the linear predictor. How to use GLM depends on the distribution of the data and nature of the response variable (continuous response, binary response, count, etc.).
The family of distributions and link functions in MADlib 1.7 are:
For example, the number of items bought by customers in a grocery store would typically be modeled with a Poisson distribution and a log link function. Number of items would be the response variable, and explanatory variables could be customer demographics, macroeconomic factors, and promotions included to build the Poisson regression.
In addition to the distributions in the above table, other new regression algorithms added in MADlib 1.7 are multinomial regression and ordinal regression.
Multinomial regression is a classification method that generalizes binomial regression to multiclass problems having more than two possible discrete outcomes. It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables which may be real-valued, binary-valued, categorical-valued, etc.
Ordinal regression is a type of regression analysis used for predicting an ordinal variable where a variable’s value exists on an arbitrary scale and only the relative ordering between different values is significant. The two most common types of ordinal regression models are ordered logit, which applies to data that meets the proportional odds assumption, and ordered probit. Both types are included in MADlib 1.7.
An example of ordinal regression is Yelp data on restaurants and their crowd-sourced ratings. A restaurant’s Yelp rating is an ordered variable, ranging from 1 to 5. We could round a restaurant’s rating to its nearest 0.5 and set it as the response variable in an ordered probit model. Restaurant characteristics, including food type, price range, location, etc., and information on those who rated the restaurant, such as average rating, number of reviews submitted, etc., could be added as explanatory variables in the model and a logit link function used.
The parallel nature of MADlib’s algorithm design is demonstrated in the chart below. For ordinal regression with a probit link function that is similar to Yelp restaurant rating example above, execution time scales linearly with number of rows in the the training set:
Using a Pivotal Data Computing Appliance (DCA) half-rack for GPDB 4.2.7.1 and a DCA half-rack for HAWQ 1.2.1.0 with 8 nodes and 6 segments per node.
Decision Trees
Decision trees are supervised learning methods that predict the value of a target variable based on several input variables. They can be easily visualized and are intuitive to understand. Interior nodes of the tree split data tuples using a threshold value for one of the input variables and each leaf node represents a value of the target variable.
MADlib 1.7 has a completely new and improved implementation that runs up to 40 times faster than the previous version. Additional features include pruning methods, surrogate variables for NULL handling, cross validation, tuning parameters and visualization of the trained tree.
Let’s look at an example using the Car Evaluation Data Set from UC Irvine Machine Learning Repository. This data set describes the “acceptability” or selection of a car based on the following input variables: purchase price, maintenance cost, number of doors, capacity, truck size and safety rating.
This data set in Greenplum format is available for download here.
The SQL statement to train the decision tree is:
SELECT * FROM madlib.tree_train( 'car_eval', -- Data table 'output', -- Table to store model 'id', -- ID column name 'class', -- Column to predict '*', -- Use all features NULL, -- features to exclude (none for this case) 'gini' -- Classification impurity function );
The resulting tree can be exported in DOT format, which is a plain text graph description language that is both human and machine readable.
-- Export tree to dot file pset format u pset tuples_only o dt_output.dot SELECT madlib.tree_display('output'); o
A number of programs can be used to render DOT graphs. The Unix shell command to do so is:
dot -Tpdf dt_output.dot -o dt_output.pdf
which results in the following decision tree:
Random Forest
Although a single decision tree is intuitive to understand, it may overfit the data. One way to mitigate this problem is by building an ensemble of classifiers, each of which produces a tree modeled on some combination of the input data. The results of these models are then combined to yield a single prediction, which is highly accurate at the expense of some loss in interpretation.
MADlib 1.7 has a completely new and improved implementation of random forest that includes variable importance metrics and ability to explore each tree in the forest independently.
For the same car evaluation data set we used above, here is the SQL statement to train the random forest:
SELECT * FROM madlib.forest_train( 'car_eval', -- Data table 'rf_output', -- Table to store model 'id', -- ID column name 'class', -- Column to predict '*', -- Use all features '', -- features to exclude (none for this case) '', -- Grouping columns (no grouping for this case) 10, -- Number of trees to train 3, -- Use 3 randomly-selected features for each node TRUE, -- Compute variable importance 1, -- Use single permutation for variable importance 4 -- Maximum depth for each tree );
Let’s say we are interested in understanding variable importance, that is, which variables contribute the most and the least to prediction in the training data. The SQL is:
SELECT unnest(regexp_split_to_array(cat_features, ',')) as variable, unnest(cat_var_importance) as importance FROM rf_output_group, rf_output_summary;
which produces the following output:
variable | importance
————–+———————
maint | 0.0272744901879319
persons | 0.088661843494196
lug_boot | 0.00573215979836386
safety | 0.0826413222054395
doors | 0
buying | 0.0384399018694643
(6 rows)
From this table, it appears that safety rating is the most important explanatory variable for acceptability, and the number of doors is the least important (actually irrelevant).
Other Interfaces
Each of the new algorithms in MADlib 1.7 are supported by PivotalR for users who prefer an R interface rather than SQL. PivotalR combines the usability of R with the performance and scalability benefits of in-database/in-Hadoop® computation.
Also, each model can be exported in Predictive Model Markup Language (PMML) format. PMML is an XML-based file format to provide a way for applications to describe and exchange models.
Learning More:
- Read the Release Notes, download the source, or join the forums
- Find out more about PivotalR, Pivotal Greenplum Database, or Pivotal HAWQ
- Read other articles from Pivotal Data Scientists
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author