How Data Scientists Can Tame Jupyter Notebooks for Use in Production Systems

July 12, 2018 Timothy Kopp

Uncounted pixels have been spilled about how great Jupyter Notebooks are (shameless plug: I've spilled some of those pixels myself). Jupyter Notebooks allow data scientists to quickly iterate as we explore data sets, try different models, visualize trends, and perform many other tasks. We can execute code out-of-order, preserving context as we tweak our programs. We can even convert our notebooks into documents or slides to present to our stakeholders.

Jupyter Notebooks help us work through a project from its earliest stages to a point where we can say a great deal. "Yes, we now know which demographics are most responsive to your advertisements." "Yes, we can build a model and expect it to give you useful predictions." But what happens when we want to say, "Here is an artifact that will generate these predictions when I am gone"? Or, "Here is a model that you can integrate with your other analytics systems"? Because of their interactive nature, Jupyter Notebooks require a person to drive them. While Jupyter has built-in facilities to convert a notebook to an executable script, this is rarely sufficient in practice.

In this post I'll present a tool I’ve created that allows one to use Jupyter Notebooks to create and modify production-ready code for data science applications.

Command-line Arguments: A Motivating Example

A common task when productionalizing code originally developed in a notebook is integrating with the environment in which the code is to be run. Often we want our program to be executed on the command-line so that it can be run by tools like cron and Concourse, which almost always involves accepting, parsing, and reporting errors on command-line arguments. Most languages have built-in utilities for doing this, such as Python's argparse, but one usually doesn't write a Jupyter Notebook expecting it to accept command-line arguments.

A solution to this problem is to maintain two versions of our code. We convert our notebook to a script using the built-in utility, and add in the command-line boilerplate. When we change the code in the notebook, we copy those changes over to our script version, a process prone to human error and forgetfulness. If we're tracking changes in the script version with a version control system like git (something that's messy to do with a notebook, given Jupyter's JSON file format), we have to manually inspect the different commits to know the version to which our notebook corresponds. If we want to match our notebook up to a different version, we have to manually copy the changes over.

Manually maintaining a command-line version and a notebook version of a codebase isn't the worst thing in the world. But what happens when the difference between notebook and production isn't being command-line executable, but instead is interfacing with a database or servicing an API? Maintaining separate slightly-different "production" and "development/notebook" versions of our program quickly becomes a nightmare.

Solution: Automate the Conversion with nbconvert

It would be great if we could maintain both versions of the code, both being necessary for our workflow, in the same file. Ideally, we could execute the code both as a notebook and as a standalone script, sharing common code and documentation but selectively behaving differently depending on the manner in which it was run. This was my goal when writing a pair of twin scripts I've named notebook-tools, which use the nbformat library. While these tools are specifically for Python notebooks, the idea is easily applied to other programming languages.

We define a special additional syntax for the Python language, which is valid Python, but which our tool can parse in order to convert between Python scripts and Jupyter Notebooks. The provided tools can convert a Jupyter Notebook of Python code into this syntax, and vice-versa. Since the syntax is just Python, it can be executed in a standard Python interpreter. This syntax consists of four elements:

  • markdown cell

  • general code cell

  • Jupyter code cell

  • script code cell

Markdown cell

A markdown cell is denoted by a multiline Python string. Since the Python interpreter executes string literals as a no-op without a method call or preceding assignment, encoding markdown cells in this way has no impact on the effect of executing the program.

# This is a Markdown cell
It can encode \LaTeX and everything!

General Code Cell

A normal code cell is denoted by "#>". The lines following it are treated as Python to be executed no matter the context.

#>
print("This code will be executed in Jupyter and when run as a script")
    
#>
print("So will this, but in a notebook, it will be in its own cell")

Jupyter Code Cell

This is the tricky one. The start of a Jupyter code cell is "#nb>" ("nb" stands for "notebook"). Every line following it intended to be in the same cell should start with a "#", i.e. a Python comment character. This is because notebook-only cells should not be run when executed as a script.

#nb>
#print("I'll only be executed when converted to a Jupyter notebook")

Script Code Cell

Script code cells are the complement of Jupyter code cells. They are executed when run as a script but not in a notebook. We denote a script code cell with "#py>". The tool comments out all of the code in the script code cell when converted to a notebook. This way, the cell can be viewed and even executed, but none of the effects of the code’s execution take place. This is important if you're accustomed to mindlessly executing cells in a row until you reach the one in which you are interested.

#py>
print("I will execute when run as a script, but my notebook cell will be commented out")

Using These Tools in a Production Workflow

These tools make developing a data science application that runs in production much easier. We can seamlessly switch between the notebook and script formats. One moment we're debugging in Jupyter, the next we're submitting a dozen long-running jobs via the command-line.

All of this is enabled with two scripts:

# Convert notebook to executable Python script
$ to-script my-cool-notebook.ipynb my-production-script.py   

# Convert a script enriched with the specified format
# to a notebook
$ to-notebook my-production-script.py my-notebook-for-debugging.ipynb

A motivated data scientist who buys into this workflow completely could even automate the conversion between the formats with git hooks to perform the conversion each time a particular git command is run.

Jupyter Notebooks are a boon to data scientists, helping us quickly get from the exploratory stages of a project to a proof-of-concept. By leveraging the nbformat library, we can continue to use this tool effectively as we transition our project into a production data science application. Even better, we can develop our applications with a mind for production right from the start.

About the Author

Timothy Kopp

Tim Kopp is a senior data scientist at Pivotal, where he works with customers to build and deploy machine learning models to leverage their data. He holds a PhD in computer science from the University of Rochester. As a researcher, Tim developed algorithms for inference in statistical-relational machine learning models.

More Content by Timothy Kopp
Previous
Pivotal Joins Other Technology Industry Leaders To Advance Open Source Licensing
Pivotal Joins Other Technology Industry Leaders To Advance Open Source Licensing

Pivotal joins other technology and open source leaders in supporting the GPL Cooperation Commitment.

Next
Conversant Processes 200 Billion Events Per Day With Pivotal Greenplum
Conversant Processes 200 Billion Events Per Day With Pivotal Greenplum

Pivotal’s Jeff Kelly and Conversant’s Shaun Litt and John Conley talk big data analytics at Greenplum Summi...

Enter curious. Exit smarter.

Learn More