Data Science Deep Dive: Applying Machine Learning To Customer Churn

July 18, 2016 Esther Vasiete

Joint work with Gautam Muralidhar.

One of the most important business metrics for executives is churn rate—the rate at which your customers stop doing business with you. Today, data driven companies use data science to effectively predict which customers are likely to churn. Such programs allow companies to proactively protect revenue.

The Pivotal data science team has worked with many companies on churn prediction problems and has built up significant expertise in the area. Previous blog posts (part 1 | part 2) discussed the problem of churn prediction in the financial industry. Here, we will discuss how these types of problems are solved, using an example of our work with one of the world’s largest networking and communications companies. Below, we will illustrate how different sources of data are brought together, how features are extracted, and how Pivotal Big Data Suite is used to build churn models which yield actionable insights and help protect revenue.

Typical Data Sources In Networking And Communications

Networking and communication companies typically have business level data, usage logs, and call center/support tickets, among other data assets. This data is generated from their consumer and business customer interactions, and these data sets vary in terms of volume and behavior. The combination of these data sources leads to a classic mix of structured and unstructured data. At Pivotal, we bring this data together to fully exploit the truly parallel and scalable environment offered by Pivotal Greenplum and Pivotal HDB, which is based on Apache HAWQ (incubating).

Business data refers to the customer-level information, such as account and contract details, and includes subscription products, special offers, payment types, license types, etc. Their usage data includes detailed, atomic, user-level log information for every product and service—this can easily escalate into billions of rows in a database. For instance, video conference users generate data like start and end times, conference participants, join and exit times, and their access devices, e.g. mobile, desktop, etc. There is also information about their operating systems, whether video was on/off for each participant, whether a particular feature was used during the call, and the amount paid for that particular call, etc.

Call center logs are usually very different in nature from business and usage data. These come mostly in the form of unstructured data—opened support tickets, notes on customer inquiries, and complaints transcribed into text. Such data needs to be processed using text analytics techniques and could be used in addition to structured data, which contains useful information about customer engagement and satisfaction.

Feature Extraction At Scale Through Data Parallelism

Customers’ usage data needs to be processed and summarized into a set of predictor variables (commonly referred as features) These features are fed into a machine learning framework. One way to do this is by aggregating usage data into daily, weekly or monthly intervals at a customer (i.e. organization) level. Such aggregation can be easily done at scale with Pivotal data platforms—they allow the data to be distributed across nodes and operated on in parallel. Whereas slower systems can take hours or days to allow data scientists to iterate, these platforms take seconds and minutes, making data science work much more productive.

Below, we show a code snapshot that aggregates domestic call usage on a monthly basis for each organization:

CREATE VIEW aggregated_usage_data AS
SELECT
   customer_id,
   date,
   sum(case when call_id is not null then 1 else 0 end) as num_calls
FROM(
  
  SELECT 
     customer_id, 
     date_trunc('month', call_start) as date, 
     call_id 
  FROM 
     usage_data 
  UNION ALL 
  SELECT 
     customer_id, 
     generate_series(
          date_trunc('month', min(call_start)),   --  start date
          date_trunc('month', max(call_start)),   --  end date
          '1 month’                               --  interval
     )  as date,
     null 
  FROM 
     usage_data
  GROUP BY customer_id
) q
GROUP BY 1,2

This query creates a time series signal and explains the variation of call throughput on a monthly basis. Importantly, the most inner query is a combination of two queries using UNION ALL. The first query selects the data needed to aggregate call usage per customer on a monthly basis, and the second query generates a series of monthly timestamps per customer, ranging from the first available call until the last one. This union is needed to guarantee that months without usage (if any) will still appear in the result table if no calls occurred during the month. This is instrumental to the entire analysis. We need to keep those non-usage data points as a discrete signal for further signal processing and feature extraction.

Let us compare, in Figure 1, the resulting time series for two different customers. Organization #1 has been a customer for 6 years and usage increased despite some remarkably low peaks, which could be due to seasonality. On the other hand, organization #2 has only been a customer since 2013 and its usage has been rather inconsistent since then. There are actually several months with no usage at all. Independent of the number of data points available (i.e. months in service), we can explain and summarize these signals with a set of engineered features and let a machine learning model learn what usage patterns are most likely to belong to customers who will churn.

Figure 1: Aggregated monthly usage over time for two different organizations. Organization #1 has used telephony services for six years and presents an increasing trend despite some low peaks due to seasonality (consistent low peaks at the end/beginning of the year). Organization #2 has had rather inconsistent usage over three years with several months of no usage at all.

One of the most important aspects of any predictive modeling framework is summarizing the aggregated data into features. We start by extracting several summary statistics such as the mean, median, and standard deviation, etc. In addition to that, we also perform some transformations on the data such as computing a difference vector (x[n] - x[n-1]) to look for the rate of usage changes across months. All of these operations can be performed in PostgreSQL. Window functions in PostgreSQL are really useful to perform time series analysis in-database, but the queries become rather complex and challenging as you want to perform more sophisticated signal processing techniques, for example, transformation of a time-series signal into its power spectral density using Fourier analysis to unveil periodicity patterns in customers’ usage.

By using procedural language extensions PL/X derived from PostgreSQL, we can take advantage of any open source libraries for signal processing and statistical analysis then perform all of these operations in-database. For instance, one such library is Python’s statsmodels library, which contains a ‘seasonal_decompose’ function and extracts trend and seasonality from a signal (see Fig. 2). Another personal favorite is the SciPy library, which contains a handful of Fourier transforms and spectral density estimation methods written in Python.

The following PL/Python UDF contains starter code to perform some of the aforementioned transformations and engineer features from the original and derived signals. Here, the original signal is first normalized by its mean to compensate for different account sizes (e.g. some organizations have thousands of users, others have tens, and consumers are single individuals) and different months in service.

CREATE FUNCTION temporal_feature_extraction (x float[], start_date string)
RETURNS float[]
AS $$

   # Import Python libraries
   import statsmodels.api as sm
   import pandas as pd
   import numpy as np

  # Normalize signal 
   x_norm = x/np.mean(x)

   # Initialize feature vector
   feat = [] 

   # Extract features from original signal x[n]
   feat.append(np.std(x_norm))
   feat.append(np.median(x_norm))
   feat.append(np.max(x_norm))

   # Extract features from difference signal 
   diff = x_norm[n] - x_norm[n-1]
   feat.append(np.mean(diff))
   feat.append(np.max(np.abs(diff)))
   	
   # Seasonal decomposition with statsmodels
   dates = pd.date_range(start_date, periods=len(x), freq='M')
   df = pd.Series(x_norm, index=rng)
   result = sm.tsa.seasonal_decompose(df)
   trend = result.trend

   # Extract features from trend 
   feat.append(np.std(trend))
   feat.append((np.sum(trend)/np.mean(trend))
   	
   # Return feature vector
   return feat
$$ LANGUAGE plpythonu

There are two important aspects of our approach here that work together to make our data science efforts easier and faster. First, we are using the procedural extensions from PostgreSQL. Open source Python and R libraries can be packaged as UDFs without much effort. Two, we use either Pivotal Greenplum or Pivotal HDB, based on Apache HAWQ (incubating). Both of these have a massively parallel processing architecture, which we can take advantage of because the UDF is setup to process the data and extract the features independently per customer (the GROUP BY customer_id is key here). By grouping by customer_id, the UDF runs in parallel in different segments, speeding up processing time dramatically. So, we can scale and achieve extremely fast throughput, which allows us to iterate on our code much faster.

See a sample call to the temporal_feature_extraction PL/Python function below. The GROUP BY customer_id clause is responsible for aggregating the data and invoking the feature extraction UDF per each customer. This will inherently occur in parallel in each of the nodes of a MPP database such as Pivotal Greenplum.

SELECT
   customer_id, 
   temporal_feature_extraction(
            array_agg(num_calls order by date),  -- time-series signal
            min(date)                            -- start date
    ) AS feature_vector
FROM aggregated_usage_data 
GROUP BY customer_id

Figure 2: Visualization of trend and seasonal signals extracted from a monthly (normalized) aggregated usage signal in Organization #1 and using the open-source statsmodels Python library. Both seasonality and trend can be used to better understand usage behavior over time (e.g. low usage peaks at the end of year might not be indicative of churn but rather holiday season).

We have covered some examples of usage-related features with a temporal component. These are just one type of feature, and, as mentioned before other feature examples include average number of participants in a call, percentage of international calls, use of a particular special offer, device and operating system of customer, number of tickets/complains filed, etc.

Scoring Customers, Predicting Churn, and Explaining Why

In a subscription-based business model (like our case), customers might have the intention to churn days or months before the renewal date. Since they have prepaid the services until the end of the subscription period, they are technically still “active” subscribers. Because of this behavior, we look at churn as a delayed indicator of the customer’s intentions to churn. Hence, it is a suitable idea to define churn as a binary classification problem—a customer churns after the termination of their subscription or not.

At the end of the day, our churn prediction model must allow the company to take action and prevent churn. With that in mind, we used a regularized logistic regression approach. This allows us to establish a relationship between the probability of churn, p, and a group of features which can be interpreted and organized by their impact. In mathematical terms, this relationship is defined as:

where β₀, β₁,… β_n represent the model coefficients or feature weights that have been learned from a large number of training examples (i.e. historical data of customers that churned in the past), and χ₀, χ₁,… χ_n represent the feature values from each instance in the training set.

The feature weights are comparable because χ₀, χ₁,… χ_nhave been standardized prior to fitting the model. By looking at the feature weights, we can provide feature interpretability. For instance, let’s assume a very simple model with just the three features, as shown in Table 1. First, the percentage of international calls contributes to churn, i.e., the more international calls a customer makes, the more likely they will cancel services. This is a positive indicator of churn. Second, it is unlikely a customer will cancel service when they regularly have a high number of participants per call. This is a negative indicator of churn. Finally, there are features with no impact, such as use of weekend special offers. It has a feature weight of zero.

Table 1: Feature names and feature weights of a simple logistic regression model with three features. Feature weights are comparable because the feature values have been previously standardized.

In addition to providing feature interpretability at a model level, we can also obtain a customer churn score breakdown by breaking apart β₀ + β₁χ₁ + β₂χ₂… β_nχ_n. While, the overall number increases as the probability of churn increases, we probably want to know why. By looking at the sign and magnitude of each of the products, we can provide a customized idea of why a particular customer is going to churn based on the model. Then, it will become easier to take a tailored strategy to maintain a customer that has a high probability of churn.

Figure 3: Churn score breakdown example for a particular customer showing the top three churn factors—decreasing trend in usage time series signal, low average call duration, and having a family subscription offer. It also shows the top three non-churn factors—consistent usage over time, having a weekend subscription offer and the activation of a video feature. Importantly, a different customer might present a different churn score breakdown.

Putting The Insight Into Action

Now that we have sourced the data, extracted features, scored customers to predict churn, and identified the factors which contribute to churn, we need to use this information to impact business operations and revenue. Of course, we could use a one-time analysis to reach out to customers proactively, learn more about them, adjust our capabilities or offers, and reduce churn. But, to stay competitive in this day and age, companies need to use machine learning as part of an an ongoing process because products, services, markets, and user behaviors change over time.

At Pivotal, most of our customers begin to take the intelligence from machine learning models and embed them within applications. For example, the machine learning models developed in Pivotal Greenplum and Pivotal HDB, based on Apache HAWQ (incubating), can be easily operationalized into scoring as a service apps on Pivotal Cloud Foundry and CRM systems. We also provide open source code examples of this.

In such a use case, customer call duration is streamed into an application. The application checks to see if the customer has been scored in the past week. If it has not, a scoring service is run, and the output shows the churn score and the breakdown of reasons why. This data can then be inserted into a CRM app as an activity for a customer success team to act on. This allows the customer success team to come up with a customized strategy and regain the customer before the subscription ends.

Learning More:

Read more blog posts by Pivotal Data Scientists
Find out more about Pivotal Big Data Suite, Greenplum, Apache HAWQ (incubating), and Pivotal Cloud Foundry.

About the Author

Biography

Data Lakes: The Lure of the Cloud

Data Lakes Look to the CloudsYou can be excused for conflating the concept of a data lake with the set of o...

How To Deal With Class Imbalance And Machine Learning On Big Data

As a member of Pivotal Data Science Labs, Greg Tam penned this article as an exploration in class imbalance...

Data Science Deep Dive: Applying Machine Learning To Customer Churn

Typical Data Sources In Networking And Communications

Feature Extraction At Scale Through Data Parallelism

Scoring Customers, Predicting Churn, and Explaining Why

Putting The Insight Into Action

Learning More:

About the Author

Previous

Next

Related content in this Stream

Part five in a five-part series, this webcast will be a demonstration of the integration of Apache Zeppelin and Pivotal HDB. Apache Zeppelin is a web-based notebook that enables interactive data...

Data Lakes Look to the CloudsYou can be excused for conflating the concept of a data lake with the set of open source technologies collectively called Apache™ Hadoop®. Since Pentaho CTO James...

As a member of Pivotal Data Science Labs, Greg Tam penned this article as an exploration in class imbalance problems for machine learning. Pulling insights from real-world scenarios, Tam walks...

HAWQ Labs has been working with the HAWQ team in Palo Alto for the last while, seeding a Toolsmiths team. We TDD our bash, and TDD our docker-images: come hear C.J. Jameson and Ben Calegari show you

Part three in a five-part series, this webcast will be a demonstration of the integration of Hortonworks HDB and Apache Hadoop YARN. YARN provides the global resource management for HDB for...