Path Functions in Apache MADlib

May 11, 2016 Frank McQuillan

 

35942-MADlib-path-functions-sfeaturedThank you to Rahul Iyer from Pivotal for contributing to the software and to this article.

Path functions are a powerful capability in the data science toolkit, and they are now available in the newest release of the open source Apache MADlib (incubating) library.

For example, path functions can be used to reason over website, shopping cart, and customer support clickstreams to identify the golden paths to purchase, multi-channel promotion effectiveness, or customer churn. In addition, they can be used in predictive analytics use cases, like analyzing millions of sensor logs from cars or other machines to identify common patterns in part failure. These scenarios can also improve safety and substantially lower operating costs. The list of potential applications includes scenarios around fraud, security, healthcare, advertising, travel, and more.

Apache MADlib (incubating) is a SQL-based, open source library for scalable in-database analytics that supports Greenplum Database, Pivotal HDB and PostgreSQL. The library offers data scientists numerous distributed implementations of mathematical, statistical and machine learning methods for structured and unstructured data.

MADlib 1.9 adds the following capabilities:

  • Path functions—perform regular pattern matching over a sequence of rows and extract useful information about the pattern matches
  • Support vector machines—completely new implementation for classification and regression including support for non-linear kernels
  • Advanced matrix operations—more than 15 new methods including several decomposition operations
  • Stemmer—determine roots of words in a vocabulary
  • New statistics functions—covariance matrix and proportion of variance for principal component analysis
  • Support for Pivotal HDB 2.0 (powered by Apache HAWQ)—new release of industry leading Hadoop native SQL platform

Introduction To Path Functions

The goal of the MADlib path function is to perform regular pattern matching over a sequence of rows and extract useful information about the pattern matches. The useful information could be a simple count of matches or something more involved like aggregations or window functions.

Symbols are used to identify particular rows of interest. Then standard PostgreSQL pattern matching uses these symbols to identify patterns across the rows of interest. This is similar in concept to regular expressions which identify patterns within strings of text to extract fragments. However, instead of searching text, we are searching through a table in the database.

For example, a symbol can be defined for purchase events by online shoppers. Preceding events that led to the purchase can be identified and analyzed, in order to find the common actions that resulted in the purchase. Or conversely, to discover actions that resulted in an exit without a purchase having been made.

Here are the main steps to use path functions:

  1. Partition input rows.
  2. Order the partitions.
  3. Define symbols to match rows of interest.
  4. Define regular expression of symbols and operators to define patterns to match in the ordered partitions.
  5. Define an aggregate function to compute for each pattern match.
  6. If desired, output the pattern matches for inspection or to operate on with subsequent queries.

The function syntax is as follows:

madlib.path(
    source_table,	-- Name of input table
    output_table,	-- Table name to store path results
    partition_expr,	-- How to partition input table
    order_expr,		-- How to order partitions
    symbol,		-- Symbols to identify rows of interest
    pattern,		-- PostgreSQL pattern matching expressions
    aggregate_func,	-- Aggregate to run on pattern matches
    persist_rows		-- Boolean to write out pattern matches
)

Let’s look at an example.

e-Commerce Example

This data set describes shopper behavior on a notional web site that sells beer and wine. A beacon fires an event to a log file when the shopper visits different pages on the site—landing page, beer selection page, wine selection page, and checkout. There are other pages on the site, like help pages and account settings, and these show up in the logs as well. Let’s assume that the log has already been sessionized.

Here are the first few rows of the log file sorted by time:

image01

In this subset of the data, there are two checkout events—user 100821 makes a purchase of $39 and user 101121 makes a purchase of $15.

Identifying Quick Purchasers

Now, it is time to identify specific sessions associated with quick purchases. We are interested in sessions with an order placed within 4 pages of entering the shopping site via the landing page. This because we want to understand revenue from quick purchasers. If we define the symbols ‘land’ for a landing page event and ‘buy’ for a checkout event, we can represent our pattern of interest by the regular expression:

'(land)[^(land)(buy)]{0,2}(buy)'

In other words, the pattern is a visit to a landing page followed by 0 to 2 visits on non-entry, non-checkout pages and then followed by a purchase. The path function is as follows:

SELECT madlib.path(
     'eventlog',                -- Name of input table
     'path_output',             -- Table name to store path results
     'session_id',              -- Partition input table by session
     'event_timestamp ASC',     -- Order partitions by event time
     $$ land:=page='LANDING',
        wine:=page='WINE',
        beer:=page='BEER',
        buy:=page='CHECKOUT',
        other:=page<>'LANDING' AND page<>'WINE' AND page<>'BEER' AND  page<>'CHECKOUT'
$$,    			     -- Symbols for page types
      '(land)[^(land)(buy)]{0,2}(buy)', -- Purchase within 4 pages
     'sum(revenue) as checkout_rev',    -- Sum revenue by checkout
     TRUE                       -- Persist matches
     );

SELECT * FROM path_output ORDER BY session_id, match_id;

This produces the following result:

image03

There are three resulting cases of quick purchases. In the case of session 102, there were actually two quick purchases within the same session. Let’s confirm by viewing the pattern matches:

SELECT * FROM path_output_tuples ORDER BY session_id ASC, event_timestamp ASC;
shopper behavior

As well, we may want to use a window function instead of an aggregate. You can write window functions on the output tuples. Continuing the previous example, let’s say we want to compute average revenue for checkouts within 4 pages of entering the shopping site via the landing page:

SELECT DATE(event_timestamp), user_id, session_id, revenue,
    avg(revenue) OVER (PARTITION BY DATE(event_timestamp)) as avg_checkout_rev
    FROM path_output_tuples
    WHERE page='CHECKOUT'
    ORDER BY user_id, session_id;

This produces the following result:
image02

Golden Path Analysis

So, we have identified quick purchasers through aggregates and windows functions. Now, we would like to find the most successful shopper paths through the site by doing a golden path analysis. These are the paths we want to understand and build upon—these are the paths to increase our revenue. Since our data set is notional, we decide this means the most frequently viewed page just before a checkout is made:

SELECT madlib.path(
     'eventlog',                -- Name of input table
     'path_output',             -- Table name to store path results
     'session_id',              -- Partition input table by session
     'event_timestamp ASC',     -- Order partitions by event time
     $$ land:=page='LANDING',
        wine:=page='WINE',
        beer:=page='BEER',
        buy:=page='CHECKOUT',
        other:=page<>'LANDING' AND page<>'WINE' AND 
               page<>'BEER' AND page<>'CHECKOUT' 
$$,		          	   -- Symbols for page types
      '[^(buy)](buy)',          -- Pattern to match
     'array_agg(page ORDER BY session_id ASC, event_timestamp ASC) as page_path',    			   -- Build array with shopper paths
     TRUE                       -- Persist matches
     );
     
SELECT count(*), page_path from
    (SELECT * FROM path_output) q
GROUP BY page_path
ORDER BY count(*) DESC
LIMIT 10;

Note the use of array_agg in the aggregate expression to arrange the pattern matches for subsequent query. This produces the following result:

madlib-path-last-table

In this small data set, the wine page is viewed more frequently than the beer page prior to checkout. We could perform a similar query to identify common unsuccessful paths through the site in order to understand how they could be improved.

Learning More:

 

About the Author

Frank McQuillan

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

Previous
Pivotal Perspectives—Pivotal Cloud Foundry 1.4: A Potted Tour
Pivotal Perspectives—Pivotal Cloud Foundry 1.4: A Potted Tour

A platform is only useful to developers if it is kept up to date. This week, host Simon Elisha updates us o...

Next
Deploying Microservice Architectures with VMware Tanzu Application Service
Deploying Microservice Architectures with VMware Tanzu Application Service

Matt Stine, a Principal Software Engineer at Pivotal, gave an excellent, fast-paced talk at Cloud Foundry S...

×

Subscribe to our Newsletter

!
Thank you!
Error - something went wrong!