One Query To Rule Them All: Demonstrating Integrated Analytics in Greenplum

June 25, 2018 Pavan Kumar Nagula

One of the top advantages of the Greenplum Database is the ability to run integrated analytics in-database. Many Greenplum users have been able to:

  • Run all their advanced analytical processing (Text, Machine Learning, GeoSpatial, Graph and others) in-database
  • Run all their analytical queries in parallel and in a distributed way
  • Provide seamless access to disparate data repositories, like Hadoop or public cloud data lakes

Working on a company project I recently wrote this very interesting integrated analytics SQL query that I consider is very rich but also simple to write and solves a very complex problem:

“Return a list of people who ‘works at Pivotal’ and know each other ‘directly’ and whose names sounds like ‘Peter’ and ‘Pavan’ withdrawn an amount > $200 within 24 hours at an ATM less than 2 KM from reference latitude and longitude.”

For demonstration purposes, consider the following tables:

People: it holds information about people.

Id Integer Unique id
Lastname Varchar Last name of a person
Firstname Varchar First name of a person
Description Varchar Description of text where he works and about himself

Transactions: it holds information about transactions done by people at different locations and time.

TransId Integer Unique id for each transaction
Id Integer Unique id
Tran_Date Timestamp Transaction date and time
Amount Float Amount withdrew
Locid Integer Id of ATM location where money has been withdrawn

Location: it holds address, latitude and longitude information of ATMs.

Locid Integer Unique id for each location
Address Varchar Address of ATM
Lat Float Latitude of ATM location
Lng Float Longitude of ATM location

Links: it holds links or connections between people, they may be directly or indirectly connected.

Src Integer Source node
Dest Integer Destination node

And finally, the Results table holds our results set.

Id Integer Unique ID of each person
Firstname Varchar First name of a person
Lastname Varchar Last name of a person
Amount Float Amount Withdrawn
Tran_Date Timestamp Transaction date and time
Lat Float Latitude of ATM location
Lng Float Longitude of ATM location
Address Varchar Address of ATM
Description Varchar Description

With this understanding of the tables being involved, let me explain step by step all the processing and tools utilized on this Greenplum integrated query:

“works at Pivotal ”

  • This is a text search with GPText: Greenplum’s GPText provides text indexing and searching. GPText is based on Apache Solr technology.

“ People whose name sounds like ‘Peter’ and ‘Pavan’ ”

  • This is a fuzzy string matching: The Greenplum Database Fuzzy String Match extension provides functions to determine similarities and distance between strings based on various algorithms.

“ People who have direct contact with each other ”

  • This is done with the Apache MADlib ML and analytics library running on Greenplum: Apache MADlib’s Breadth-First Search (BFS) finds all nodes reachable from the source vertex. BFS search helps to find indirect contacts also at multiple levels, for direct contact just an entry search in links table is enough but for indirect contact we have to do graph search.

“ ATM locations should be less than 2 KM from a reference latitude and longitude “

  • This is done with Greenplum PostGIS Extension: Greenplum’s PostGIS extension calculates distance between locations. This PostGIS interface helps Greenplum to do spatial processing in-database.

 

The following single query does it all!

Let us look at the query closer and what it does.

 

SELECT distinct

a.id,

a.firstname,

a.lastname,

amount,

tran_date,

c.lat,

c.lng,

address,

a.description,

d.score

FROM people a, transactions b, location c,

(SELECT w.id, q.score FROM people w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gpadmin.public.people' , 'works at Pivotal', null) q

WHERE (q.id::integer) = w.id order by 2 desc) d

WHERE soundex(firstname)=soundex($1)

AND a.id=b.id

AND amount > $3

AND (extract (epoch from tran_date) - extract(epoch from now()))/3600 < $4

AND st_distance_sphere(st_makepoint($5, $6),st_makepoint(c.lng, c.lat))/1000.0 <= 2.0

AND b.locid = c.locid

AND a.id=d.id

 

  • Gptext_search() function searches for string ‘work at Pivotal’ in description column indexed by GPText. It enables organizations to process mass quantities of raw text data for large-scale text analytics
  • SoundEX() function converts a string to soundex code. The Soundex system is a method of matching similar-sounding (similar phonemes) names by converting them to the same code.
  • St_makepoint() and St_distance_sphere(), PostGIS functions makes latitude and longitudes as geometrical points and calculates distance between points in meters respectively. The Greenplum Database PostGIS extension includes support for GiST-based R-Tree spatial indexes and functions for analysis and processing of GIS objects. The Greenplum Database PostGIS extension supports the optional PostGIS raster data type and most PostGIS Raster functions.

 

Let’s see how direct contact check happens in the query using Greenplum MADlib.

 

SELECT MADlib.graph_bfs('people','id','links',NULL,'1','out');

Madlib.graph_bfs function takes parameters 1) Vertex table 2) Vertex id column in the Vertex table 3) Edge table containing edge information 4) Source and destination vertex id column in edge table 5) Source vertex id for the algorithm to start 6) Name of the table to store the result of BFS

Once the above query executes the MADlib graph_bfs function outputs vertex id, distance and parent node information.

 

SELECT 1 into linkchk from out where dist=1 and id=v2.id;

 

The above query checks to see if there is direct link or contact between two people by checking distance = 1 (direct contact), and if we want to find out in-direct contact at a distance of two connections then we will check dist = 2(in-direct contact) .

 

Finally, when we execute the function Get_People the function runs the query and inserts the results into results table which we finally query.

 

SELECT distinct * FROM results;

 

About the Author

Pavan Kumar Nagula

Pavan Nagula has around 15 years of diversified IT experience (Programmer, DBA, Certified Product Owner and Data Scientist) in which last 6 years into data science. All the years he has been into coding, database administration (Designing logical database models, writing complex SQL), product management (developing product roadmaps, release plans, sprint plans and execution), machine learning engineer/data scientist (scoping, response to RFP, data adequacy, data exploration, feature engineering, model building, assessment and fine tuning, deep learning methods). Overall his experience covers depth and breadth required to be a data science leader.

Follow on Linkedin
Previous
Large Chinese Bank’s Successful TD to GP Migration
Large Chinese Bank’s Successful TD to GP Migration

Today I used Google Translate to read this phenomenal success case of TD to GP by one of the world’s larges...

No More Articles