Data science is an exciting and innovative field that attracts a lot of attention these days, in part because it can add great value to the enterprise. What gets less attention, however, is the time-consuming work of collecting and preparing data sets for input into predictive analytics algorithms. A New York Times article on this topic reported that data scientists can spend 50 to 80 percent of their time collecting and preparing data. A survey reported in Forbes found that data scientists find this to be a less interesting part of their work.
However, data transformation cannot be fully automated. Understanding the integrity and completeness of data is an essential activity which helps a data scientist make proper scoping decisions, and determine how to fill in the blanks when certain pieces of data are missing. April Song, a data scientist at Pivotal comments:
“Before we start applying machine learning algorithms on our data, we must first reshape and transform our data. Having tools to help wrangle the data enables us to be more agile in our approach to data science.”
Apache MADlib (incubating) is a SQL-based open source library for scalable in-database analytics that supports Greenplum Database/Pivotal Greenplum, Apache HAWQ (incubating)/Pivotal HDB and PostgreSQL. The library offers data scientists numerous distributed implementations of mathematical, statistical and machine learning methods, including many utilities for data transformation.
New utilities have been added in the recent MADlib 1.9.1 release, including:
- Pivot—data summarization tool that can do basic OLAP type operations
- Sessionization—time-oriented session reconstruction on a data set comprising a sequence of events
- Prediction metrics—set of metrics to evaluate the quality of predictions of a model
In this article we will take a closer look at each of these. Please also refer to this previous article on path functions in MADlib which describes a way to perform regular pattern matching over a sequence of rows and extracting useful information about the pattern matches, such as aggregations or window functions.
The goal of the MADlib pivot function is to provide a data summarization tool that can do basic OLAP type operations on data stored in one table and output the summarized data to a second table. There are many variations on how to pivot data, depending on the needs of a particular situation, but the common theme is reshaping the data into a convenient format for insight and subsequent operations.
The pivot syntax is as follows:
pivot( source_table, -- Name of input table output_table, -- Table to store pivoted data index, -- How to group (desired rows in output) pivot_cols, -- What to pivot pivot_values, -- Values to be summarized aggregate_func, -- Aggregate(s) to apply to values fill_value, -- How to fill NULLs resulting from pivot keep_null, -- Whether to pivot NULLs output_col_dictionary -- For managing long column names )
Some of the terminology above may look familiar to users of Pandas pivot tables, which informed the MADlib implementation. (Pandas is a popular Python data analysis toolkit.) Let’s look at a couple examples using this small table:
The simplest possible pivot is:
SELECT madlib.pivot( 'pivset', -- Name of input table 'pivout', -- Table to store pivoted data 'id', -- Index 'piv', -- Pivot 'val'); -- Values
The resulting pivot table is:
Note that the columns names are auto-generated and intended to be descriptive. For example, the column
valis the column containing values that were aggregated
- Aggregate function avg was used (default since no other aggregate function was specified)
- The number
10indicates the values in the column
Next let’s look at a slightly more complicated pivot:
SELECT madlib.pivot( 'pivset', -- Name of input table 'pivout', -- Table to store pivoted data 'id', -- Index 'piv', -- Pivot 'val, val2' -- 2 value columns 'val=avg, val2=sum', -- Different aggregates for each value '-999', -- Fill NULLs in pivot output with -999 TRUE, -- Pivot NULL values into columns TRUE); -- Create a column header dictionary
The resulting pivot table is:
Because we opted to create a dictionary, the column headers are defined in a separate table:
A dictionary can improve readability and programmatic operation on the pivoted table. Also, when auto-generated column headers grow to exceed the PostgreSQL limit of 63 characters, a dictionary table is automatically created to avoid potential side effects from truncation.
For more information on other ways to pivot data, please refer to the pivot user documentation.
The MADlib sessionize function performs time-oriented session reconstruction on a data set comprising a sequence of events. A defined period of inactivity indicates the end of one session and beginning of the next session. Sessions can be useful in many domains including web analytics, network security, manufacturing, finance, and operational analytics.
The sessionize syntax is as follows:
sessionize( source_table, -- Name of input table output_table, -- Table name for sessionized output partition_expr, -- How to partition input table time_stamp, -- Time stamp column for session calculation max_time, -- Maximum delta time to define session output_cols, -- Output format options create_view -- Create a view or materialize a table
Let’s look an example. The simple data set below describes shopper behavior on a notional web site that sells beer and wine. A beacon fires an event to a log file when the shopper visits different pages on the site—landing page, beer selection page, wine selection page, and checkout. There are other pages on the site, like help pages and account settings, and these show up in the logs as well.
Here are the first few rows of the log file, sorted by time:
The sessionize function call is:
SELECT madlib.sessionize( 'eventlog', -- Name of input table 'sessionize_output', -- Table to store sessionize results 'user_id', -- Partition by user id 'event_timestamp', -- Time stamp column '0:30:0' -- Time out of 30 min defines a session );
The result is:
User 100821 had two sessions on this day and user 202201 had three sessions. For more information on sessionization, please refer to the sessionize user documentation.
This utility provides a set of metrics to evaluate the quality of model predictions. A model may become less accurate over time if the data changes, or if there is a fundamental shift in the characteristics of the system being modeled. Data scientists use a variety of tools for model evaluation. The following prediction metrics have been added to MADlib:
- For continuous variables:
- Mean absolute error
- Mean absolute percentage error
- Mean percentage error
- Mean squared error
- Adjusted R-squared
- For binary classification:
- True positive rates (TPR) and false positive rates (FPR)
- Area under ROC curve
- For multi-class classifiers:
- Confusion matrix
For example, the area under an ROC curve is a common way to evaluate a binary classifier compared to a random coin toss:
SELECT madlib.area_under_roc ( 'test_set', -- Name of input table 'table_out', -- Output table 'pred', -- Predicted values 'obs'); -- Observations
The result below shows that the model is better than random (area of 0.5) and less than perfect (area of 1.0):
For more details, please refer to the prediction metrics user documentation.
- Read the MADlib 1.9.1 release notes, download the source code or binaries, or join the user forum
- Refer to the MADlib user documentation
- Try out the iPython notebooks for the examples in this article
- Find out more about Pivotal Greenplum/Greenplum Database, or Apache HAWQ/Pivotal HDB
- Read other articles from Pivotal Data Scientists
About the Author
Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.More Content by Frank McQuillan