Joint work performed by Niels Kasch and Mariann Micsinai of Pivotal’s Data Science Labs.
Financial firms collect large volumes of data from all realms of our daily lives. These data assets are used to build predictive models for many purposes, such as understanding and predicting customer behavior. Insights from these models can be applied to areas such as customer acquisition and retention.
In this blog series, we will explain the important factors that enable banks in the retail finance and asset management industries to build, operationalize, and derive actionable insight from such models. First we need to consider how the data scientist navigates a financial institution’s multiple definitions of “customer” and “churn” to construct the correct population for analysis. In the next blog post, we will continue examining churn prediction by looking at predictive and explanatory modeling tools and resulting customer applications.
What constitutes churn? Defining the dependent variable
When building a predictive model, data scientists require a precise definition of the dependent variable–i.e., what should be predicted or explained. Tight collaboration with business experts—portfolio managers, IT warehouse data owners, and subject matter experts—is required to derive this definition. For retail finance customer retention models, business clarifications are often needed to determine whether the following cases constitute churn or not. Here are some potential scenarios:
- A customer closes her account, then opens another account with better conditions at the same institution. The net outward flow of assets is zero.
- A customer transfers 90% of her assets to another institution, but does not close her account. The net outward flow of assets is 90%.
- A company decides to change its 401(k) plan administrator. In this case, all 401(k) employee accounts are transferred to the competitor. The net outflow of assets is 100%.
It is important that when drawing conclusions from the model, business stakeholders are aware of the assumptions that affect the dependent variable.
Which customers are part of the analysis? Defining the population
To arrive at the population of interest for a model, we employ the Data Waterfall approach. This approach narrows down the population from all conceptual customers to those customers that are of interest to the business, and for which it is feasible to develop a model. Our approach asks questions such as:
- What is the total universe of customers?
- Who is in the population of interest?
- How are they defined in the data?
- Who will the model be applied to?
The purpose of these questions is to get an understanding if the available data assets represent the entire customer base or a subset thereof. In practice, it often occurs that financial regulatory requirements prevent two internal business units from sharing customer data. This issue can lead to an incomplete picture of the customer base: understanding who is missing from the population and why will inform the applicability of any developed model.
Another factor relating to time series data is the temporal overlap of different data assets. Transactional data for all customers may be available for the past 10 years, but web browsing behavior only for the last two years. In this case, it may be more effective to narrow down the population to all active customers within the past two years. Under such a justification, web-browsing data can reliably be incorporated in a model without having to consider two pseudo models, one which includes web data and one which does not.
The definition of what constitutes a customer–i.e. defining the level on the analysis–is far from trivial. In the case of churn modeling for bank accounts, the most obvious level of analysis is a bank account. Complicating the definition is that an individual, multiple individuals, corporations, as well as other entities can own a bank account. Defining the population means making a decision on the granularity level of the analysis. Should the population be defined on an account level or an individual level? This decision often has an impact on defining the dependent variable. Is removing one of many users from a bank account considered churn?
These types of questions are often only answered through close cooperation with subject matter experts and business users. For this very reason, our data science engagements involve discovery meetings, followed by multiple feedback sessions, with all stakeholders. The feedback from these sessions helps to define business rules, adjust or correct definitions, and often leads to the discovery of additional data sources that can augment the model. With the population properly defined, one can bring to bear multiple data assets and analytic tools to construct models for application and operationalization.
Check out the next blog post in the finance series, where we will look at the approaches and algorithms Pivotal data scientists have used to equip our customers with operational models and actionable insights.
About the Author
Mariann Micsinai is a member of the Data Science team at Pivotal’s New York City location. She holds a Ph.D. in Computational Biology from NYU/Yale and pursued Master’s degrees in Computational Biology, Mathematics, Economics, International Studies and Linguistics. In the bioinformatics field, Mariann focused on developing novel computational methods in human cancer genetics and on analyzing and integrating next-generation sequencing experimental data (ChIP-Seq, RNA-Seq, Exome-Seq, 4C-Seq etc.). Prior to her experience in computational biology, she worked for Lehman Brothers’ Emerging Market Trading desk in a market risk management role. In parallel, she taught Econometrics and Mathematics for Economists at Barnard College, Columbia University. At Pivotal, Mariann is involved in solving big data problems in finance and health care analytics.More Content by Mariann Micsinai