One of the top advantages of the Greenplum Database is the ability to run integrated analytics in-database. Many Greenplum users have been able to:
- Run all their advanced analytical processing (Text, Machine Learning, GeoSpatial, Graph and others) in-database
- Run all their analytical queries in parallel and in a distributed way
- Provide seamless access to disparate data repositories, like Hadoop or public cloud data lakes
Working on a company project I recently wrote this very interesting integrated analytics SQL query that I consider is very rich but also simple to write and solves a very complex problem:
“Return a list of people who ‘works at Pivotal’ and know each other ‘directly’ and whose names sounds like ‘Peter’ and ‘Pavan’ withdrawn an amount > $200 within 24 hours at an ATM less than 2 KM from reference latitude and longitude.”
For demonstration purposes, consider the following tables:
People: it holds information about people.
Id | Integer | Unique id |
Lastname | Varchar | Last name of a person |
Firstname | Varchar | First name of a person |
Description | Varchar | Description of text where he works and about himself |
Transactions: it holds information about transactions done by people at different locations and time.
TransId | Integer | Unique id for each transaction |
Id | Integer | Unique id |
Tran_Date | Timestamp | Transaction date and time |
Amount | Float | Amount withdrew |
Locid | Integer | Id of ATM location where money has been withdrawn |
Location: it holds address, latitude and longitude information of ATMs.
Locid | Integer | Unique id for each location |
Address | Varchar | Address of ATM |
Lat | Float | Latitude of ATM location |
Lng | Float | Longitude of ATM location |
Links: it holds links or connections between people, they may be directly or indirectly connected.
Src | Integer | Source node |
Dest | Integer | Destination node |
And finally, the Results table holds our results set.
Id | Integer | Unique ID of each person |
Firstname | Varchar | First name of a person |
Lastname | Varchar | Last name of a person |
Amount | Float | Amount Withdrawn |
Tran_Date | Timestamp | Transaction date and time |
Lat | Float | Latitude of ATM location |
Lng | Float | Longitude of ATM location |
Address | Varchar | Address of ATM |
Description | Varchar | Description |
With this understanding of the tables being involved, let me explain step by step all the processing and tools utilized on this Greenplum integrated query:
“works at Pivotal ”
- This is a text search with GPText: Greenplum’s GPText provides text indexing and searching. GPText is based on Apache Solr technology.
“ People whose name sounds like ‘Peter’ and ‘Pavan’ ”
- This is a fuzzy string matching: The Greenplum Database Fuzzy String Match extension provides functions to determine similarities and distance between strings based on various algorithms.
“ People who have direct contact with each other ”
- This is done with the Apache MADlib ML and analytics library running on Greenplum: Apache MADlib’s Breadth-First Search (BFS) finds all nodes reachable from the source vertex. BFS search helps to find indirect contacts also at multiple levels, for direct contact just an entry search in links table is enough but for indirect contact we have to do graph search.
“ ATM locations should be less than 2 KM from a reference latitude and longitude “
- This is done with Greenplum PostGIS Extension: Greenplum’s PostGIS extension calculates distance between locations. This PostGIS interface helps Greenplum to do spatial processing in-database.
The following single query does it all!
Let us look at the query closer and what it does.
SELECT distinct
a.id,
a.firstname,
a.lastname,
amount,
tran_date,
c.lat,
c.lng,
address,
a.description,
d.score
FROM people a, transactions b, location c,
(SELECT w.id, q.score FROM people w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gpadmin.public.people' , 'works at Pivotal', null) q
WHERE (q.id::integer) = w.id order by 2 desc) d
WHERE soundex(firstname)=soundex($1)
AND a.id=b.id
AND amount > $3
AND (extract (epoch from tran_date) - extract(epoch from now()))/3600 < $4
AND st_distance_sphere(st_makepoint($5, $6),st_makepoint(c.lng, c.lat))/1000.0 <= 2.0
AND b.locid = c.locid
AND a.id=d.id
- Gptext_search() function searches for string ‘work at Pivotal’ in description column indexed by GPText. It enables organizations to process mass quantities of raw text data for large-scale text analytics
- SoundEX() function converts a string to soundex code. The Soundex system is a method of matching similar-sounding (similar phonemes) names by converting them to the same code.
- St_makepoint() and St_distance_sphere(), PostGIS functions makes latitude and longitudes as geometrical points and calculates distance between points in meters respectively. The Greenplum Database PostGIS extension includes support for GiST-based R-Tree spatial indexes and functions for analysis and processing of GIS objects. The Greenplum Database PostGIS extension supports the optional PostGIS raster data type and most PostGIS Raster functions.
Let’s see how direct contact check happens in the query using Greenplum MADlib.
SELECT MADlib.graph_bfs('people','id','links',NULL,'1','out');
Madlib.graph_bfs function takes parameters 1) Vertex table 2) Vertex id column in the Vertex table 3) Edge table containing edge information 4) Source and destination vertex id column in edge table 5) Source vertex id for the algorithm to start 6) Name of the table to store the result of BFS
Once the above query executes the MADlib graph_bfs function outputs vertex id, distance and parent node information.
SELECT 1 into linkchk from out where dist=1 and id=v2.id;
The above query checks to see if there is direct link or contact between two people by checking distance = 1 (direct contact), and if we want to find out in-direct contact at a distance of two connections then we will check dist = 2(in-direct contact) .
Finally, when we execute the function Get_People the function runs the query and inserts the results into results table which we finally query.
SELECT distinct * FROM results;
About the Author
Follow on Linkedin