Plotting Using an MPP Database

March 14, 2017 Greg Tam

Data visualization is the process of transforming and condensing data into an easily digestible graphic. It is crucial in helping data scientists understand their data and share their insights with others. With the recent surge of big data, data scientists must adapt their current techniques of visualizing data since traditional methods of plotting are limited by the machine’s memory and thus only work on smaller, local data. This blog introduces methods of plotting to generate histograms and scatterplots—two of the most common ways to visualize univariate and bivariate data—in an MPP database such as Pivotal Greenplum (GPDB) or Apache HAWQ (incubating) or in PostgreSQL. Additionally, we will show how to compute an ROC curve, a plot which measures performance of a binary classifier, in-database.

MPP Histogram

Histograms are visual representations of the distribution of univariate data. They are created by grouping data into bins and plotting the number of observations that falls into each bin. If the amount of data we wish to plot is too large, we would not be able to use standard histogram functions that come in Python or R. However, because a histogram is a summary of the data, comprising of bin locations and heights, we can perform the legwork to compute these in database by using the parallel capabilities of GPDB or HAWQ. The desired output table, which can be plotted in Python or R, will be much smaller than the original data; it will only contain columns indicating the bin locations and heights.

We can achieve the task of mapping the data to their bins in three distinct steps:

Scale our data to range from 0 to the desired number of bins.
Take the floor function of our scaled data to discretize them into distinct groups.
Scale our data back to its original scale.

One caveat we must be wary about is that if our data point takes the maximum value for that variable, it may be put into a bin by itself. We account for this by placing it in the second to last bin.

This set of steps can be summarized in a single formula:

Now that we have created our bin numbers, we can group by the bin locations and heights. Template code to illustrate this process is shown below (note that the DISTRIBUTED clause does not exist in PostgreSQL, so if using PostgreSQL, please omit the final line):

CREATE TABLE histogram_values
     AS WITH min_max_table AS
             (SELECT MIN(column_name) AS min_val,
                     MAX(column_name) AS max_val
                FROM table_name
             ),
             binned_table AS
             (SELECT CASE WHEN column_name < max_val
                               THEN FLOOR((column_name - min_val)::NUMERIC
                                          /(max_val - min_val)
                                          * nbins
                                         )
                                    /nbins * (max_val - min_val) 
                                    + min_val
                          WHEN column_name = max_val
                               THEN (nbins - 1)::NUMERIC/nbins * (max_val - min_val) 
                                    + min_val
                          ELSE NULL
                           END AS bin_loc
                FROM table_name
                     CROSS JOIN min_max_table
               WHERE column_name IS NOT NULL
             )
      SELECT bin_loc, COUNT(*) AS bin_height
        FROM binned_table
       GROUP BY bin_loc
       ORDER BY bin_loc
 DISTRIBUTED BY (bin_loc);

We replace column_name and table_name with their appropriate names and nbins with an integer value. The resulting table will have two columns—the bin number and its frequency, i.e., the number of observations that fall into that bin number. All of the bins have the same width and are spread out evenly.

Plotting MPP Histograms in Python

Using this information, we can then plot bar charts to create our histogram. In Python, we can pull the histogram_values table locally via the psycopg2 library. We can run the following code assuming conn is our psycopg2 connection object that is pointing to GPDB or HAWQ and psql is the alias for the pandas.io.sql module:

sql = '''
SELECT *
  FROM histogram_values
 ORDER BY bin_loc;
'''
py_hist_df = psql.read_sql(sql, conn)

This data can then be used to plot a histogram using matplotlib where plt is the alias for the matplotlib.pyplot module.

bin_width = py_hist_df.bin_loc.diff().mean()

plt.figure(figsize=(10, 7))
plt.bar(py_hist_df.bin_loc, py_hist_df.bin_height,
        width=bin_width, edgecolor='black')
plt.title('matplotlib Histogram (Python)', size=26)
plt.xlabel('x-axis', size=22)
plt.ylabel('Frequency', size=22)
plt.xticks(size=14)
plt.yticks(size=14)
plt.tight_layout()

Figure 1: MPP Histogram in Python

Plotting MPP Histograms in R

We can follow this same procedure in R by using the RPostgreSQL library to bring data from GPDB or HAWQ locally and using ggplot2 or R’s default plotting library to plot the histograms.

sql <- "
SELECT *
  FROM histogram_values
 ORDER BY bin_loc;
"
r_hist_df <- dbGetQuery(conn, sql)

Again, conn is a DBIConnection object in R that is pointing to GPDB or HAWQ. This pulls in data in the same manner as before. Now, we can plot our histogram using ggplot.

# Since all bin widths are the same, we can define 
# them by the distance between the first two points
plot_bin_width <- r_hist_df$bin_loc[2] - r_hist_df$bin_loc[1]

ggplot(r_hist_df, aes(bin_loc, weight = bin_height)) +
  geom_histogram(binwidth = plot_bin_width, col = 'black', fill = 'dodgerblue2') +
  labs(title = 'ggplot Histogram (R)', x = 'x-axis', y = 'Frequency') +
  theme(plot.title = element_text(size = 26, hjust = 0.5),
        axis.title.x = element_text(size = 22),
        axis.title.y = element_text(size = 22),
        axis.text.x = element_text(size = 14), 
        axis.text.y = element_text(size = 14)
       )

Figure 2: MPP Histogram in R

MPP Scatter Plot

Scatter plots differ from histograms in that they do not condense data, that is, we must plot each point individually. However, there will likely be many overlapping points if the data set is very large. Trying to plot this is inefficient because many points will be hidden under others. A more economical solution would be to group by bin number as we did for histograms. We can achieve this using the same technique, but instead group by bin numbers in both the x and y directions.

The resulting table would have three columns—the bin number in the x direction, the bin number in the y direction, and the frequency. We can visualize this by plotting the bin locations with partial transparency. Areas of lower density would be be more transparent and areas of higher density would be more opaque. We can use a similar query as before, but add another bin column to account for the added dimension.

Plotting MPP Scatter Plots in Python

We pull in our condensed data into Python as before.

sql = '''
SELECT *
  FROM scatter_plot_values
 ORDER BY scat_bin_x, scat_bin_y;
'''
py_scat_df = psql.read_sql(sql, conn)

Then we can make our plot using matplotlib’s scatter function.

# Manually specify color with opacity proportional to frequency
col = np.zeros((py_scat_df.shape[0], 4))
# Set to blue
col[:, :3] = (0.29, 0.44, 0.69)
# Add transparency
col[:, 3] = py_scat_df.freq/py_scat_df.freq.max()

plt.scatter(py_scat_df.scat_bin_x, py_scat_df.scat_bin_y, c=col, lw=0)
plt.title('matplotlib Scatter Plot (Python)', size=26)
plt.xlabel('x variable', size=22)
plt.ylabel('y variable', size=22)
plt.xticks(size=14)
plt.yticks(size=14)
plt.tight_layout()

Figure 3: MPP Scatter Plot in Python

Plotting MPP Scatter Plots in R

The steps to do this in R are analogous.

sql <- "
SELECT *
  FROM scatter_plot_values
 ORDER BY scat_bin_x, scat_bin_y;
"
r_scat_df <- dbGetQuery(conn, sql)

We can also use ggplot to create a similar plot.

ggplot(r_scat_df, aes(scat_bin_x, scat_bin_y, alpha = freq)) +
  geom_point(col = 'dodgerblue2') +
  labs(title = 'ggplot Scatter Plot (R)', x = 'x variable', y = 'y variable') +
  theme(plot.title = element_text(size = 26, hjust = 0.5),
        axis.title.x = element_text(size = 22),
        axis.title.y = element_text(size = 22),
        axis.text.x = element_text(size = 14), 
        axis.text.y = element_text(size = 14),
        legend.title = element_text(size = 18),
        legend.text = element_text(size = 16)
       )

Figure 4: MPP Scatter Plot in R

ROC Curve

An ROC curve is a plot we can use to measure the performance of a binary classifier. It is created by computing the true and false positive rate for every possible threshold value. This computation may not be possible locally if the data is too large. Suppose we have a table named model_scores that contains all of the true labels, y_true, and the predicted probabilities, y_score, we can compute the ROC curve values in-database using the following code:

CREATE TABLE roc_curve_values
     AS WITH row_num_table AS
             (SELECT row_number()
                         OVER (ORDER BY y_score) AS row_num, *
                FROM model_scores
             ),
             pre_roc AS 
             (SELECT *,
                     SUM(y_true)
                         OVER (ORDER BY y_score DESC) AS num_pos,
                     SUM(1 - y_true)
                         OVER (ORDER BY y_score DESC) AS num_neg
                FROM row_num_table
             ),
             class_sizes AS
             (SELECT SUM(y_true) AS tot_pos,
                     SUM(1 - y_true) AS tot_neg
                FROM model_scores
             )
      SELECT DISTINCT
             y_score AS thresholds,
             num_pos/tot_pos::NUMERIC AS tpr,
             num_neg/tot_neg::NUMERIC AS fpr
        FROM pre_roc
             CROSS JOIN class_sizes
 DISTRIBUTED BY (thresholds);

We achieve this by using window functions to sort the observations by their score and taking the sum of y_true and 1 - y_true. This produces two columns—num_pos and num_neg—which represent the number of predicted positive and negative observations in the model given that the threshold is equal to y_score. We can then take these numbers and divide by the number of total positive and negatives respectively to get the true and false positive rates.

Bringing this table locally and sorting by threshold, we can form the ROC curve by plotting these two rates against each other.

Figure 5: MPP ROC Curve

AUC Score

The ROC score is a diagnostic check to assess the performance of a binary classifier, but we cannot use it to directly compare different classifiers. To do that, we can use the AUC (area under the curve) statistic, which is precisely the area underneath the ROC curve. We can compute this by summing up the area of the trapezoids formed by successive points along the ROC curve.

  WITH roc_trapezoids AS
       (SELECT *,
               AVG(tpr)
                   OVER (ORDER BY thresholds DESC
                          ROWS BETWEEN 1 PRECEDING
                           AND CURRENT ROW
                        ) AS avg_height,
               fpr - LAG(fpr, 1)
                   OVER (ORDER BY thresholds DESC)
                   AS width
          FROM roc_curve_values
         ORDER BY thresholds DESC
       )
SELECT SUM(avg_height * width) AS auc_score
  FROM roc_trapezoids;

As of the most recent version of MADlib (v1.9.1), there are built-in functions to get the ROC curve and AUC score. However, older versions of MADlib or PostgreSQL will not provide such functionality.

Next Steps

In this blog, we have outlined a basic set of steps to generate histograms and scatter plots in an MPP database. They are useful for exploring univariate and bivariate data before modeling or for visualizing results after modeling. We have also shown how to compute an ROC curve in-database. This is an important tool to use after modeling. We can extend this procedure of doing heavy computation in-database, then plotting locally to many other types of plots. For reusable code that defines functions to perform these plots, please refer to https://github.com/gregtam/mpp-plotting.

About the Author

Greg Tam is a Data Scientist for Pivotal, where he helps customers dig deep to understand their data. Prior to joining Pivotal, he was at Harvard University, where he completed his master’s degree in statistics. He also has a bachelor’s degree in probability and statistics from the math department at McGill University.

May 31 - How to Turn Tweets Into Revenue Using Data Science Webinar

Apr 19 - Journey to Cloud-Native: Continuous Delivery with Artificial Intelligence Webinar

Plotting Using an MPP Database

MPP Histogram

Plotting MPP Histograms in Python

Plotting MPP Histograms in R

MPP Scatter Plot

Plotting MPP Scatter Plots in Python

Plotting MPP Scatter Plots in R

ROC Curve

AUC Score

Next Steps

About the Author

Previous

Next

Related content in this Stream

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Wondering what the White House’s executive order on artificial intelligence means for your business? This blog summarizes what you need to know and provides ideas for how to get started.

Synchronizing artificial intelligence and data science to multiple facets of the application life cycle aids enterprises with generating more business value from their applications.

At VMware Explore in Barcelona, we’re announcing new artificial intelligence and machine learning offerings in the VMware Tanzu portfolio that can help organizations drive business innovation.

A new vector database introduced in VMware Tanzu GemFire enables organizations to unlock the full potential of generative AI.

There are differences between working on a traditional software product and one that incorporates data science. Successfully folding data science into a product team is a little like hunting a bear.

In this playbook you’ll find our advice for effective ways to bring this capability—and the humans who drive it—closer into your fold.

Securing Cloud Applications demystifies complex security protocols, algorithms, and patterns, and demonstrates how to put them into practice in everyday development.

DKube on VMware Tanzu enables you to save time, resources, and cost with IT and data science teams collaborating with best-in-class model operations and infrastructure management.

With Domino Data Lab and VMware Tanzu, code-first data science teams can accelerate research, increase collaboration, and deploy models across an optimized multi-cloud infrastructure.

Hi, Spring fans! In this installment, Josh Long (@starbuxman) talks to webassembly, IoT, data science, and Java guru Brian Sletten (@bsletten).

Greenplum is open-source software for massively parallel database used for reporting, analytics, machine learning, artificial intelligence, and high concurrency SQL. Greenplum database is...

author：Hans Zeller Optimizing joins is the core part of any query optimizer. It consists of picking a good join order, the right join algorithms (hash join, nested loop join, etc.) and various...

In a previous post, we discussed the advantages of running JupyterHub on Kubernetes. We also showed you how to install a local Kubernetes cluster using kind on your Mac, as well as how to install...

Provisioning environments for data scientists and analysts to run simulations, test new models, or experiment with new datasets can be time-consuming and error-prone. Python is a popular choice...

Author: Jared Ruckle Every enterprise is refining their AI strategy. So it’s only fitting that the final installment of Greenplum Summit 2020 focused on how artificial intelligence and neural...