How Data Scientists Can Tame Jupyter Notebooks for Use in Production Systems

July 12, 2018 Timothy Kopp

Uncounted pixels have been spilled about how great Jupyter Notebooks are (shameless plug: I've spilled some of those pixels myself). Jupyter Notebooks allow data scientists to quickly iterate as we explore data sets, try different models, visualize trends, and perform many other tasks. We can execute code out-of-order, preserving context as we tweak our programs. We can even convert our notebooks into documents or slides to present to our stakeholders.

Jupyter Notebooks help us work through a project from its earliest stages to a point where we can say a great deal. "Yes, we now know which demographics are most responsive to your advertisements." "Yes, we can build a model and expect it to give you useful predictions." But what happens when we want to say, "Here is an artifact that will generate these predictions when I am gone"? Or, "Here is a model that you can integrate with your other analytics systems"? Because of their interactive nature, Jupyter Notebooks require a person to drive them. While Jupyter has built-in facilities to convert a notebook to an executable script, this is rarely sufficient in practice.

In this post I'll present a tool I’ve created that allows one to use Jupyter Notebooks to create and modify production-ready code for data science applications.

Command-line Arguments: A Motivating Example

A common task when productionalizing code originally developed in a notebook is integrating with the environment in which the code is to be run. Often we want our program to be executed on the command-line so that it can be run by tools like cron and Concourse, which almost always involves accepting, parsing, and reporting errors on command-line arguments. Most languages have built-in utilities for doing this, such as Python's argparse, but one usually doesn't write a Jupyter Notebook expecting it to accept command-line arguments.

A solution to this problem is to maintain two versions of our code. We convert our notebook to a script using the built-in utility, and add in the command-line boilerplate. When we change the code in the notebook, we copy those changes over to our script version, a process prone to human error and forgetfulness. If we're tracking changes in the script version with a version control system like git (something that's messy to do with a notebook, given Jupyter's JSON file format), we have to manually inspect the different commits to know the version to which our notebook corresponds. If we want to match our notebook up to a different version, we have to manually copy the changes over.

Manually maintaining a command-line version and a notebook version of a codebase isn't the worst thing in the world. But what happens when the difference between notebook and production isn't being command-line executable, but instead is interfacing with a database or servicing an API? Maintaining separate slightly-different "production" and "development/notebook" versions of our program quickly becomes a nightmare.

Solution: Automate the Conversion with nbconvert

It would be great if we could maintain both versions of the code, both being necessary for our workflow, in the same file. Ideally, we could execute the code both as a notebook and as a standalone script, sharing common code and documentation but selectively behaving differently depending on the manner in which it was run. This was my goal when writing a pair of twin scripts I've named notebook-tools, which use the nbformat library. While these tools are specifically for Python notebooks, the idea is easily applied to other programming languages.

We define a special additional syntax for the Python language, which is valid Python, but which our tool can parse in order to convert between Python scripts and Jupyter Notebooks. The provided tools can convert a Jupyter Notebook of Python code into this syntax, and vice-versa. Since the syntax is just Python, it can be executed in a standard Python interpreter. This syntax consists of four elements:

markdown cell
general code cell
Jupyter code cell
script code cell

Markdown cell

A markdown cell is denoted by a multiline Python string. Since the Python interpreter executes string literals as a no-op without a method call or preceding assignment, encoding markdown cells in this way has no impact on the effect of executing the program.

# This is a Markdown cell
It can encode \LaTeX and everything!

General Code Cell

A normal code cell is denoted by "#>". The lines following it are treated as Python to be executed no matter the context.

#>
print("This code will be executed in Jupyter and when run as a script")
    
#>
print("So will this, but in a notebook, it will be in its own cell")

Jupyter Code Cell

This is the tricky one. The start of a Jupyter code cell is "#nb>" ("nb" stands for "notebook"). Every line following it intended to be in the same cell should start with a "#", i.e. a Python comment character. This is because notebook-only cells should not be run when executed as a script.

#nb>
#print("I'll only be executed when converted to a Jupyter notebook")

Script Code Cell

Script code cells are the complement of Jupyter code cells. They are executed when run as a script but not in a notebook. We denote a script code cell with "#py>". The tool comments out all of the code in the script code cell when converted to a notebook. This way, the cell can be viewed and even executed, but none of the effects of the code’s execution take place. This is important if you're accustomed to mindlessly executing cells in a row until you reach the one in which you are interested.

#py>
print("I will execute when run as a script, but my notebook cell will be commented out")

Using These Tools in a Production Workflow

These tools make developing a data science application that runs in production much easier. We can seamlessly switch between the notebook and script formats. One moment we're debugging in Jupyter, the next we're submitting a dozen long-running jobs via the command-line.

All of this is enabled with two scripts:

# Convert notebook to executable Python script
$ to-script my-cool-notebook.ipynb my-production-script.py   

# Convert a script enriched with the specified format
# to a notebook
$ to-notebook my-production-script.py my-notebook-for-debugging.ipynb

A motivated data scientist who buys into this workflow completely could even automate the conversion between the formats with git hooks to perform the conversion each time a particular git command is run.

Jupyter Notebooks are a boon to data scientists, helping us quickly get from the exploratory stages of a project to a proof-of-concept. By leveraging the nbformat library, we can continue to use this tool effectively as we transition our project into a production data science application. Even better, we can develop our applications with a mind for production right from the start.

About the Author

Tim Kopp is a senior data scientist at Pivotal, where he works with customers to build and deploy machine learning models to leverage their data. He holds a PhD in computer science from the University of Rochester. As a researcher, Tim developed algorithms for inference in statistical-relational machine learning models.

Pivotal Joins Other Technology Industry Leaders To Advance Open Source Licensing

Pivotal joins other technology and open source leaders in supporting the GPL Cooperation Commitment.

Automation is the Answer at Scotiabank

How Scotiabank is Modernizing its Approach to Software DevelopmentA cloud-native platform like Pivotal Clou...

How Data Scientists Can Tame Jupyter Notebooks for Use in Production Systems

Command-line Arguments: A Motivating Example

Solution: Automate the Conversion with nbconvert

Markdown cell

General Code Cell

Jupyter Code Cell

Script Code Cell

Using These Tools in a Production Workflow

About the Author

Previous

Next

How Data Scientists Can Tame Jupyter Notebooks for Use in Production Systems

Command-line Arguments: A Motivating Example

Solution: Automate the Conversion with nbconvert

Markdown cell

General Code Cell

Jupyter Code Cell

Script Code Cell

Using These Tools in a Production Workflow

About the Author

Previous

Next

Related content in this Stream

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.