Use Automation to Streamline the Data Science QA Process

Timothy Kopp

A necessary but often tedious task data scientists must perform as part of any project is quality assurance (QA). In a data science context, QA is the task of ensuring that the data to be analyzed and modeled is suitable for whatever the use case happens to be.

For many reasons, it is neither possible nor desirable to completely automate QA. For example, thorough QA typically involves leveraging a degree of domain expertise which is not easily encoded in a computer program. However, we can still be freed of some of the tedium of QA. This post illustrates a way to mix automation with interactivity to reduce the manual steps in QA without compromising the data scientist’s ability to do that which is best done manually.

Motivating example: Brick and mortar sales data

Some common checks need to be performed on any kind of tabular or relational data. We'll start with some example QA tasks, and show to how best automate them. I will then demonstrate where the interactivity comes in by way of automatically-generated Jupyter notebooks.

Imagine you are doing analytics for a small chain of retail stores which sells a number of items and has a rewards program for enrolled customers. Your data might look something like the following:

import pandas as pd

transactions, customers, locations = [pd.read_csv(x) for x in ['transactions.csv', 'customers.csv', 'locations.csv']]
print('\n\n'.join([repr(x) for x in [transactions, customers, locations]]))

 

 

transaction_id

customer_id

total location_id

date

0

T1

C1

5.32

L1 2018-03-01

1

T2

C2

12.64

L2 2018-03-02

2

T3

C3

56.02.32

L1 2018-03-02

3

T4

C4

7.23

L2 2018-03-05

4

T5

C5

28.92

L1 2018-03-06

 

 

customer_id

first_name

last_name

phone

address

open_date

0

C1

Alice

Ackerman

555-555-1111

1 1st St

2018-03-01

1

C2

Bob

Brown

555-555-2222

2 2nd St

2018-02-12

2

C3

Charlie

Campbell

555-555-3333

2 3rd St

2010-04-12

3

C4

Dianne

Dennett

555-4444

4 4th St

2018-05-01

 

 

location_id

address

0

L1

123 Abc Rd

1

L3

92 Main St

Despite the fact that this toy problem is about a particular domain, we can abstract from it some general tasks that we'd want to perform on any tabular or relational dataset. For this tutorial, we'll focus on two such tasks:

  1. Verifying the uniqueness of columns we expect to be unique

  2. Verifying the join conditions between tables

A first pass at automating QA

In a software development context, we would simply write functions to perform our QA tasks, and then call them in an iterative fashion.

Suppose our datasets are small enough to fit into memory, so we've chosen Python 3/Pandas for our analysis. We can determine if a column that we expect to be the primary key for the table is in fact unique across all records by checking that the number of distinct values of that column is equal to the total number of records. To determine if two tables can be inner-joined without any misses, we perform an outer join, and see which values appear only on the left-hand or right-hand side of the join. While we could perform the join and count the records, the outer-join approach provides a place to look if we don't get the clean join we expected.

def is_unique(df, column):
	'''Returns true if df.column is fully populated with unique values'''
	return len(df) == len(df[column].unique())

def join_properties(left_df, left_name, left_key, right_df, right_name, right_key):
	'''Returns a dictionary of {'both', 'left_name_only', 'right_name_only'} with
	counts of the number of records true if df.column is fully populated with unique values'''
	join = pd.merge(left_df, right_df, how='outer', left_on=left_key, right_on=right_key, indicator=True)
	counts = join.groupby('_merge').agg('count')[left_key].to_dict()
	counts['{}_only'.format(left_name)] = counts.pop('left_only')
	counts['{}_only'.format(right_name)] = counts.pop('right_only')
	return counts

Now that we have methods for testing the properties in which we are interested, we can automate the QA of our toy data set using a dictionary comprehension. The comprehension iterates over a list of (table, column) pair literals. For each pair, we pass the DataFrame (accessed with Python's built-in `globals()` dictionary by its variable name) and the column to test to our `is_unique()` function, and return the result as the value keyed by the table and column. The result is a dictionary that we can quickly scan for non-unique dataframe-column pairs.

uniques = {
	'{}.{}'.format(x[0], x[1]): is_unique(globals()[x[0]], x[1])
	for x in [
    	('transactions', 'transaction_id'),
    	('customers', 'customer_id'),
    	('locations', 'location_id'),
    	('transactions', 'location_id'), # we expect this _not_ to be unique
	]
}
uniques

	{'customers.customer_id': True,
 	'locations.location_id': True,
 	'transactions.location_id': False,
 	'transactions.transaction_id': True}

Similarly, for the join QA, we iterate over a list of tuples in a dictionary comprehension and scan the output.

joins = {
	'{}.{} x {}.{}'.format(*x): join_properties(globals()[x[0]], x[0], x[1], globals()[x[2]], x[2], x[3])
	for x in [
    	('customers', 'customer_id', 'transactions', 'customer_id'),
    	('locations', 'location_id', 'transactions', 'location_id'),
	]
}
joins

 


	{'customers.customer_id x transactions.customer_id': {'both': 5,
  	'customers_only': 1,
  	'transactions_only': 0},
 	'locations.location_id x transactions.location_id': {'both': 3,
  	'locations_only': 1,
  	'transactions_only': 2}}

For small datasets, this works just fine. We avoid a much tedious work by abstracting the counts and joins away from what is interesting: the uniqueness and join characteristics of the dataset. If we find problems or anomalies with the data, we can make appropriate changes and rerun the script to verify the issues are resolved.

However, this approach has many shortcomings.

  • Expensive: For a large dataset, each one of these individual queries may be expensive, taking minutes or even hours. We surely want to run the entire battery of tests once to find initial results. But after the initial run, we don't want to run all of the tests to verify that, for example, a single column of a single table was regularized to have unique values.

  • Brittle: This approach doesn't elegantly allow slight differences in the way different tables are processed. Suppose we identify a problem in the source data: one of the store locations doesn't correctly add its customers to the rewards program, and so the customer table does not contain those customers, but the transaction table does. Having identified this issue, we might want to modify the join condition of these two tables, but no others. We'd need to develop a special convention for this edge case.

  • Non-interactive: Suppose we find an issue we weren't expecting. We will want to dig into it further and we'd probably also like the analysis that motivated the inquiry to appear nearby in our code and results. This is difficult using the current approach.

A second pass: Automating Jupyter notebook generation

Jupyter notebooks, insofar as they are able to run individual code cells out-of-order in an interactive manner, are well-suited to this task. Were we to keep our `is_unique()` and `join_properties()` functions, and manually populate a Jupyter cell for each of the calls to one of these functions, we would still have saved a moderate amount of work. Furthermore, as we needed to re-run different queries, modify the conditions for different tables, and find and explore unexpected issues, we can readily augment the notebook with more analysis while avoiding the pitfalls listed above.

However, this is still tedious, and our dense and easy-to-read list specifications are lost. We can do better.

Enter: `nbformat`

The nbformat package provides means to create and manipulate Jupyter notebooks. If you've used the command-line tool `jupyter nbconvert` to convert your Jupyter notebooks to scripts, pdfs, or the like, then you've used `nbformat`. The nbconvert package uses `nbformat` under the hood.

For our purposes, we want to focus on two key functions: `new_markdown_cell()` and `new_code_cell()`. We will recreate the exercise above, but this time, instead of computing the results in an iterative way, we will create notebook cells that can perform the computations in an iterative way.

import nbformat
import nbformat.v4 as nbf

tables = ['locations', 'transactions', 'customers']
cells = [
	nbf.new_markdown_cell('# Procedurally Generated QA Notebook'),
	nbf.new_markdown_cell('## Load tables'),
	nbf.new_code_cell(
    	'import pandas as pd\n' +
    	'\n'.join([
        	'{0} = pd.read_csv(\'{0}.csv\')'.format(x)
        	for x in tables
    	])
	),
	nbf.new_markdown_cell('## QA Functions'),
	nbf.new_code_cell(
'''def is_unique(df, column):
	\'''Returns true if df.column is fully populated with unique values\'''
	return len(df) == len(df[column].unique())

def join_properties(left_df, left_name, left_key, right_df, right_name, right_key):
	\'''Returns a dictionary of {'both', 'left_name_only', 'right_name_only'} with
	counts of the number of records true if df.column is fully populated with unique values\'''
	join = pd.merge(left_df, right_df, how='outer', left_on=left_key, right_on=right_key, indicator=True)
	counts = join.groupby('_merge').agg('count')[left_key].to_dict()
	counts['{}_only'.format(left_name)] = counts.pop('left_only')
	counts['{}_only'.format(right_name)] = counts.pop('right_only')
	return counts
'''
	),
]
cells.append(nbf.new_markdown_cell('## Uniqueness Tests'))
cells += [
	nbf.new_code_cell('is_unique({}, \'{}\')'.format(x[0], x[1]))
	for x in [
    	('transactions', 'transaction_id'),
    	('customers', 'customer_id'),
    	('locations', 'location_id'),
    	('transactions', 'location_id'),
	]
]
cells.append(nbf.new_markdown_cell('## Join Tests'))
cells += [
	nbf.new_code_cell('join_properties({0}, \'{0}\', \'{1}\', {2}, \'{2}\', \'{3}\')'.format(*x))
	for x in [
    	('customers', 'customer_id', 'transactions', 'customer_id'),
    	('locations', 'location_id', 'transactions', 'location_id'),
   	 
	]
]

nb = nbf.new_notebook()
nb['cells'] = cells
with open('generated_notebook.ipynb', 'w') as f:
	nbformat.write(nb, f)

Running this script generates the file `generated_notebook.ipynb`. The whole notebook can be run to perform the entire QA suite, or individual cells can be run manually to avoid re-running expensive computations. If the data model changes over time, it's trivial to change the above script to add new tables and tests, regenerate the notebook, and run the new tests.

More possibilities

The fun doesn't stop there. There are a number of additional ways to improve our notebook. Here are a few potential extensions:

  • Enrich the lists from which the cells are generated to include comments about what we expect the results of the tests to be, and then use that text to generate markdown cells annotating each test where it is run.

  • Enrich the lists in the manner above, and generate code cells that test those conditions programmatically.

  • Generate cells that produce relevant plots. For example, the results of the join condition QA could be nicely presented as a Venn diagram. The duplicated values in columns expected to be keys could be presented as a bar chart of the most frequently occurring values.

  • Generate the notebook in such a way that running `jupyter nbconvert --to slides` generates a slideshow to present to clients, stakeholders, data engineers, or whomever else is interested in the results.

As we said at the beginning, QA can be tedious, but it is hugely important to the success of any data science venture. By leveraging the `nbconvert` library, we can automate away some of the tedium of writing many near-identical queries, and work at a more comfortable level of abstraction, focusing on the data set itself. If our data set is sufficiently large or complex, we can even save a bit of time.

Do you have any ideas about how to take advantage of automation to improve the QA or other part of the analytics process? Let us know!

 

About the Author

Timothy Kopp

Tim Kopp is a senior data scientist at Pivotal, where he works with customers to build and deploy machine learning models to leverage their data. He holds a PhD in computer science from the University of Rochester. As a researcher, Tim developed algorithms for inference in statistical-relational machine learning models.

Previous
Modernizing your .NET apps? Look to Steeltoe's Service Connectors to speed things up.
Modernizing your .NET apps? Look to Steeltoe's Service Connectors to speed things up.

Learn how to connect to modern data services, and have confidence that it "just works," with Steeltoe and P...

Next
No, You Can't Cheat the CAP Theorem. But Pivotal Cloud Cache on PCF Comes Close. Here's How.
No, You Can't Cheat the CAP Theorem. But Pivotal Cloud Cache on PCF Comes Close. Here's How.

Pivotal Cloud Cache on PCF stretches the limits of the CAP theorem, making it uniquely positioned to offer ...