sql_magic: Jupyter Magic to Write SQL for Apache Spark and Relational Databases

July 13, 2017 Chris Rawles

Data scientists love Jupyter Notebook, Python, and Pandas. And they also write SQL. I created sql_magic to facilitate writing SQL code from Jupyter Notebook to use with both Apache Spark (or Hive) and relational databases such as PostgreSQL, MySQL, Pivotal Greenplum and HDB, and others. The library supports SQLAlchemy connection objects, psycopg connection objects, SparkSession and SQLContext objects, and other connections types. The %%read_sql magic function returns results as a Pandas DataFrame for analysis and visualization.

%%read_sql df_result
SELECT {col_names}
FROM {table_name}
WHERE age < 10

The sql_magic library expands upon current libraries such as ipython-sql with the following features:

Support for both Apache Spark and relational database connections simultaneously
Asynchronous execution (useful for long queries)
Browser notifications for query completion

# installation
pip install sql_magic

Check out the GitHub repository for more information.

Links:

About the Author

Chris Rawles is a senior data scientist at Pivotal in New York, New York, where he works with customers across a variety of domains, building models to derive insight and business value from their data. He holds an MS and BA in geophysics from UW-Madison and UC Berkeley, respectively. During his time as a researcher, Chris focused his efforts on using machine-learning to enable research in seismology.

Pivotal Data Science Technical Workshop - Dallas PAIR

Detecting Risky Assets in an Organization Using Time-Variant Graphical Model

sql_magic: Jupyter Magic to Write SQL for Apache Spark and Relational Databases

About the Author

Previous

Next

sql_magic: Jupyter Magic to Write SQL for Apache Spark and Relational Databases

About the Author

Previous

Next

Related content in this Stream

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Wondering what the White House’s executive order on artificial intelligence means for your business? This blog summarizes what you need to know and provides ideas for how to get started.

Synchronizing artificial intelligence and data science to multiple facets of the application life cycle aids enterprises with generating more business value from their applications.

At VMware Explore in Barcelona, we’re announcing new artificial intelligence and machine learning offerings in the VMware Tanzu portfolio that can help organizations drive business innovation.

A new vector database introduced in VMware Tanzu GemFire enables organizations to unlock the full potential of generative AI.

There are differences between working on a traditional software product and one that incorporates data science. Successfully folding data science into a product team is a little like hunting a bear.

In this playbook you’ll find our advice for effective ways to bring this capability—and the humans who drive it—closer into your fold.

Securing Cloud Applications demystifies complex security protocols, algorithms, and patterns, and demonstrates how to put them into practice in everyday development.

DKube on VMware Tanzu enables you to save time, resources, and cost with IT and data science teams collaborating with best-in-class model operations and infrastructure management.

With Domino Data Lab and VMware Tanzu, code-first data science teams can accelerate research, increase collaboration, and deploy models across an optimized multi-cloud infrastructure.

Hi, Spring fans! In this installment, Josh Long (@starbuxman) talks to webassembly, IoT, data science, and Java guru Brian Sletten (@bsletten).

Greenplum is open-source software for massively parallel database used for reporting, analytics, machine learning, artificial intelligence, and high concurrency SQL. Greenplum database is...

author：Hans Zeller Optimizing joins is the core part of any query optimizer. It consists of picking a good join order, the right join algorithms (hash join, nested loop join, etc.) and various...

In a previous post, we discussed the advantages of running JupyterHub on Kubernetes. We also showed you how to install a local Kubernetes cluster using kind on your Mac, as well as how to install...

Provisioning environments for data scientists and analysts to run simulations, test new models, or experiment with new datasets can be time-consuming and error-prone. Python is a popular choice...