One Query To Rule Them All: Demonstrating Integrated Analytics in Greenplum

June 25, 2018 Pavan Kumar Nagula

One of the top advantages of the Greenplum Database is the ability to run integrated analytics in-database. Many Greenplum users have been able to:

Run all their advanced analytical processing (Text, Machine Learning, GeoSpatial, Graph and others) in-database
Run all their analytical queries in parallel and in a distributed way
Provide seamless access to disparate data repositories, like Hadoop or public cloud data lakes

Working on a company project I recently wrote this very interesting integrated analytics SQL query that I consider is very rich but also simple to write and solves a very complex problem:

“Return a list of people who ‘works at Pivotal’ and know each other ‘directly’ and whose names sounds like ‘Peter’ and ‘Pavan’ withdrawn an amount > $200 within 24 hours at an ATM less than 2 KM from reference latitude and longitude.”

For demonstration purposes, consider the following tables:

People: it holds information about people.

Id	Integer	Unique id
Lastname	Varchar	Last name of a person
Firstname	Varchar	First name of a person
Description	Varchar	Description of text where he works and about himself

Transactions: it holds information about transactions done by people at different locations and time.

TransId	Integer	Unique id for each transaction
Id	Integer	Unique id
Tran_Date	Timestamp	Transaction date and time
Amount	Float	Amount withdrew
Locid	Integer	Id of ATM location where money has been withdrawn

Location: it holds address, latitude and longitude information of ATMs.

Locid	Integer	Unique id for each location
Address	Varchar	Address of ATM
Lat	Float	Latitude of ATM location
Lng	Float	Longitude of ATM location

Links: it holds links or connections between people, they may be directly or indirectly connected.

Src	Integer	Source node
Dest	Integer	Destination node

And finally, the Results table holds our results set.

Id	Integer	Unique ID of each person
Firstname	Varchar	First name of a person
Lastname	Varchar	Last name of a person
Amount	Float	Amount Withdrawn
Tran_Date	Timestamp	Transaction date and time
Lat	Float	Latitude of ATM location
Lng	Float	Longitude of ATM location
Address	Varchar	Address of ATM
Description	Varchar	Description

With this understanding of the tables being involved, let me explain step by step all the processing and tools utilized on this Greenplum integrated query:

“works at Pivotal ”

This is a text search with GPText: Greenplum’s GPText provides text indexing and searching. GPText is based on Apache Solr technology.

“ People whose name sounds like ‘Peter’ and ‘Pavan’ ”

This is a fuzzy string matching: The Greenplum Database Fuzzy String Match extension provides functions to determine similarities and distance between strings based on various algorithms.

“ People who have direct contact with each other ”

This is done with the Apache MADlib ML and analytics library running on Greenplum: Apache MADlib’s Breadth-First Search (BFS) finds all nodes reachable from the source vertex. BFS search helps to find indirect contacts also at multiple levels, for direct contact just an entry search in links table is enough but for indirect contact we have to do graph search.

“ ATM locations should be less than 2 KM from a reference latitude and longitude “

This is done with Greenplum PostGIS Extension: Greenplum’s PostGIS extension calculates distance between locations. This PostGIS interface helps Greenplum to do spatial processing in-database.

The following single query does it all!

Let us look at the query closer and what it does.

SELECT distinct

a.id,

a.firstname,

a.lastname,

amount,

tran_date,

c.lat,

c.lng,

address,

a.description,

d.score

FROM people a, transactions b, location c,

(SELECT w.id, q.score FROM people w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gpadmin.public.people' , 'works at Pivotal', null) q

WHERE (q.id::integer) = w.id order by 2 desc) d

WHERE soundex(firstname)=soundex($1)

AND a.id=b.id

AND amount > $3

AND (extract (epoch from tran_date) - extract(epoch from now()))/3600 < $4

AND st_distance_sphere(st_makepoint($5, $6),st_makepoint(c.lng, c.lat))/1000.0 <= 2.0

AND b.locid = c.locid

AND a.id=d.id

Gptext_search() function searches for string ‘work at Pivotal’ in description column indexed by GPText. It enables organizations to process mass quantities of raw text data for large-scale text analytics
SoundEX() function converts a string to soundex code. The Soundex system is a method of matching similar-sounding (similar phonemes) names by converting them to the same code.
St_makepoint() and St_distance_sphere(), PostGIS functions makes latitude and longitudes as geometrical points and calculates distance between points in meters respectively. The Greenplum Database PostGIS extension includes support for GiST-based R-Tree spatial indexes and functions for analysis and processing of GIS objects. The Greenplum Database PostGIS extension supports the optional PostGIS raster data type and most PostGIS Raster functions.

Let’s see how direct contact check happens in the query using Greenplum MADlib.

SELECT MADlib.graph_bfs('people','id','links',NULL,'1','out');

Madlib.graph_bfs function takes parameters 1) Vertex table 2) Vertex id column in the Vertex table 3) Edge table containing edge information 4) Source and destination vertex id column in edge table 5) Source vertex id for the algorithm to start 6) Name of the table to store the result of BFS

Once the above query executes the MADlib graph_bfs function outputs vertex id, distance and parent node information.

SELECT 1 into linkchk from out where dist=1 and id=v2.id;

The above query checks to see if there is direct link or contact between two people by checking distance = 1 (direct contact), and if we want to find out in-direct contact at a distance of two connections then we will check dist = 2(in-direct contact) .

Finally, when we execute the function Get_People the function runs the query and inserts the results into results table which we finally query.

SELECT distinct * FROM results;

About the Author

Pavan Nagula has around 15 years of diversified IT experience (Programmer, DBA, Certified Product Owner and Data Scientist) in which last 6 years into data science. All the years he has been into coding, database administration (Designing logical database models, writing complex SQL), product management (developing product roadmaps, release plans, sprint plans and execution), machine learning engineer/data scientist (scoping, response to RFP, data adequacy, data exploration, feature engineering, model building, assessment and fine tuning, deep learning methods). Overall his experience covers depth and breadth required to be a data science leader.
Follow on Linkedin

Large Chinese Bank’s Successful TD to GP Migration

Today I used Google Translate to read this phenomenal success case of TD to GP by one of the world’s larges...

No More Articles

One Query To Rule Them All: Demonstrating Integrated Analytics in Greenplum

About the Author

Previous

Most Recent

Author | Nihal Jain Troubleshooting and identifying supportability issues in a complex database system can be a daunting task. However, with the advent of tools like gpsupport, the process has...

Authors | Kevin Yeap & Brent Doil Introduction Greenplum Upgrade (gpupgrade) is a utility that allows in-place upgrades from Greenplum Database (GPDB) 5.x version to 6.x version. Version 1.7.0...

Greenplum Database is a massively parallel processing (MPP) database designed for high-performance analytics and data warehousing. Similar to handling other MPP databases, it mandates routine...

Author | Wen Lin Introduction Greenplum is an open-source, massively parallel data warehouse designed for analytics and AI applications. Efficient data compression is vital in Greenplum to reduce...

Frequently tuned Greenplum parameters: Please find below the list of most commonly used Greenplum parameters. Tuning these parameters can assist with the efficient memory management, performance...

Author | David Kimura Greenplum Database is a massively parallel processing (MPP) database designed for handling large-scale data warehousing and analytics workloads. One of its key features is...

Author | Andrew Repp The release of Greenplum Database 7 (GPDB7) brings with it many new features, and each of them requires some thought to make sure we get the most out of them. Today, I want to...

Big Data and AI in healthcare offer precise decision-making, personalized treatments, and efficiency enhancements. Merging structured and unstructured data provides a holistic patient view, while...

Author | Joe Smith In recent years, the healthcare industry has witnessed a tremendous surge in data volume. This data originates from various sources, encompassing patient health records,...

ALTER TABLE in Greenplum 7: Avoiding Table Rewrite The ALTER TABLE commands are commonly used for operations like adding columns, changing the column data type, and many more. In many cases, such...

Author | Huasong Fu The ALTER TABLE commands are commonly used for operations like adding columns, changing the column data type, and many more. In many cases, such commands require the whole...

Authors: Alexandra Wang & Marbin Tan Progress reporting views in Greenplum 7 Greenplum 7 provides progress reporting for certain commands during their execution. The commands include ANALYZE,...

Authors | Alexandra Wang & Marbin Tan Greenplum 7 provides progress reporting for certain commands during their execution. The commands include ANALYZE, CLUSTER, CREATE INDEX, VACUUM, COPY and...

In the increasingly digitized and data-driven business world, understanding and harnessing the power of big data is not just advantageous – it’s essential. Big data analytics tools provide the...

Author | Joe Smith In the increasingly digitized and data-driven business world, understanding and harnessing the power of big data is not just advantageous – it’s essential. Big data analytics...

We recently released GreenplumPython, a Python library that allows users to interact with Greenplum or PostgreSQL in a Pythonic way. GreenplumPython provides a pandas-like table API that is...

Author | Soumyadeep Chakraborty Subtransaction overflow can really bring a cluster to it’s knees, if coupled with long running transactions. It manifests when any given backend creates more than...

Accelerating Data Processing with PL/Container and GPU: A Powerful Combination PL/Container is an extension of the Greenplum database that provides an easy way to run user-defined functions (UDFs)...

Authors: Ashwin Agrawal, Divya Bhargov, Kristine Scott Greenplum 7 brings in the STORED generated columns feature from Postgres 12. In this blog post, we’ll take a closer look at stored generated...

Introduction Greenplum Database utilizes Multiversion Concurrency Control (MVCC) to maintain data consistency and manage concurrent access to data. Transaction snapshots are used to control what...