Thank you to Rahul Iyer from Pivotal for contributing to the software and to this article.
Path functions are a powerful capability in the data science toolkit, and they are now available in the newest release of the open source Apache MADlib (incubating) library.
For example, path functions can be used to reason over website, shopping cart, and customer support clickstreams to identify the golden paths to purchase, multi-channel promotion effectiveness, or customer churn. In addition, they can be used in predictive analytics use cases, like analyzing millions of sensor logs from cars or other machines to identify common patterns in part failure. These scenarios can also improve safety and substantially lower operating costs. The list of potential applications includes scenarios around fraud, security, healthcare, advertising, travel, and more.
Apache MADlib (incubating) is a SQL-based, open source library for scalable in-database analytics that supports Greenplum Database, Pivotal HDB and PostgreSQL. The library offers data scientists numerous distributed implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
MADlib 1.9 adds the following capabilities:
- Path functions—perform regular pattern matching over a sequence of rows and extract useful information about the pattern matches
- Support vector machines—completely new implementation for classification and regression including support for non-linear kernels
- Advanced matrix operations—more than 15 new methods including several decomposition operations
- Stemmer—determine roots of words in a vocabulary
- New statistics functions—covariance matrix and proportion of variance for principal component analysis
- Support for Pivotal HDB 2.0 (powered by Apache HAWQ)—new release of industry leading Hadoop native SQL platform
Introduction To Path Functions
The goal of the MADlib path function is to perform regular pattern matching over a sequence of rows and extract useful information about the pattern matches. The useful information could be a simple count of matches or something more involved like aggregations or window functions.
Symbols are used to identify particular rows of interest. Then standard PostgreSQL pattern matching uses these symbols to identify patterns across the rows of interest. This is similar in concept to regular expressions which identify patterns within strings of text to extract fragments. However, instead of searching text, we are searching through a table in the database.
For example, a symbol can be defined for purchase events by online shoppers. Preceding events that led to the purchase can be identified and analyzed, in order to find the common actions that resulted in the purchase. Or conversely, to discover actions that resulted in an exit without a purchase having been made.
Here are the main steps to use path functions:
- Partition input rows.
- Order the partitions.
- Define symbols to match rows of interest.
- Define regular expression of symbols and operators to define patterns to match in the ordered partitions.
- Define an aggregate function to compute for each pattern match.
- If desired, output the pattern matches for inspection or to operate on with subsequent queries.
The function syntax is as follows:
madlib.path( source_table, -- Name of input table output_table, -- Table name to store path results partition_expr, -- How to partition input table order_expr, -- How to order partitions symbol, -- Symbols to identify rows of interest pattern, -- PostgreSQL pattern matching expressions aggregate_func, -- Aggregate to run on pattern matches persist_rows -- Boolean to write out pattern matches )
Let’s look at an example.
e-Commerce Example
This data set describes shopper behavior on a notional web site that sells beer and wine. A beacon fires an event to a log file when the shopper visits different pages on the site—landing page, beer selection page, wine selection page, and checkout. There are other pages on the site, like help pages and account settings, and these show up in the logs as well. Let’s assume that the log has already been sessionized.
Here are the first few rows of the log file sorted by time:
In this subset of the data, there are two checkout events—user 100821 makes a purchase of $39 and user 101121 makes a purchase of $15.
Identifying Quick Purchasers
Now, it is time to identify specific sessions associated with quick purchases. We are interested in sessions with an order placed within 4 pages of entering the shopping site via the landing page. This because we want to understand revenue from quick purchasers. If we define the symbols ‘land’
for a landing page event and ‘buy’
for a checkout event, we can represent our pattern of interest by the regular expression:
'(land)[^(land)(buy)]{0,2}(buy)'
In other words, the pattern is a visit to a landing page followed by 0 to 2 visits on non-entry, non-checkout pages and then followed by a purchase. The path function is as follows:
SELECT madlib.path( 'eventlog', -- Name of input table 'path_output', -- Table name to store path results 'session_id', -- Partition input table by session 'event_timestamp ASC', -- Order partitions by event time $$ land:=page='LANDING', wine:=page='WINE', beer:=page='BEER', buy:=page='CHECKOUT', other:=page<>'LANDING' AND page<>'WINE' AND page<>'BEER' AND page<>'CHECKOUT' $$, -- Symbols for page types '(land)[^(land)(buy)]{0,2}(buy)', -- Purchase within 4 pages 'sum(revenue) as checkout_rev', -- Sum revenue by checkout TRUE -- Persist matches ); SELECT * FROM path_output ORDER BY session_id, match_id;
This produces the following result:
There are three resulting cases of quick purchases. In the case of session 102, there were actually two quick purchases within the same session. Let’s confirm by viewing the pattern matches:
SELECT * FROM path_output_tuples ORDER BY session_id ASC, event_timestamp ASC;
As well, we may want to use a window function instead of an aggregate. You can write window functions on the output tuples. Continuing the previous example, let’s say we want to compute average revenue for checkouts within 4 pages of entering the shopping site via the landing page:
SELECT DATE(event_timestamp), user_id, session_id, revenue, avg(revenue) OVER (PARTITION BY DATE(event_timestamp)) as avg_checkout_rev FROM path_output_tuples WHERE page='CHECKOUT' ORDER BY user_id, session_id;
This produces the following result:
Golden Path Analysis
So, we have identified quick purchasers through aggregates and windows functions. Now, we would like to find the most successful shopper paths through the site by doing a golden path analysis. These are the paths we want to understand and build upon—these are the paths to increase our revenue. Since our data set is notional, we decide this means the most frequently viewed page just before a checkout is made:
SELECT madlib.path( 'eventlog', -- Name of input table 'path_output', -- Table name to store path results 'session_id', -- Partition input table by session 'event_timestamp ASC', -- Order partitions by event time $$ land:=page='LANDING', wine:=page='WINE', beer:=page='BEER', buy:=page='CHECKOUT', other:=page<>'LANDING' AND page<>'WINE' AND page<>'BEER' AND page<>'CHECKOUT' $$, -- Symbols for page types '[^(buy)](buy)', -- Pattern to match 'array_agg(page ORDER BY session_id ASC, event_timestamp ASC) as page_path', -- Build array with shopper paths TRUE -- Persist matches ); SELECT count(*), page_path from (SELECT * FROM path_output) q GROUP BY page_path ORDER BY count(*) DESC LIMIT 10;
Note the use of array_agg
in the aggregate expression to arrange the pattern matches for subsequent query. This produces the following result:
In this small data set, the wine page is viewed more frequently than the beer page prior to checkout. We could perform a similar query to identify common unsuccessful paths through the site in order to understand how they could be improved.
Learning More:
- Read the MADlib 1.9 release notes, download the source code or binaries, or join the user forum
- Read the MADlib path functions user documentation
- Download the iPython notebook for the examples in this post.
- Find out more about Greenplum Database or Apache HAWQ/Pivotal HDB
- Read other articles from Pivotal Data Scientists
About the Author