VLDB 2015—Pivotal’s Chief Scientist Recaps Keynotes and Papers

September 23, 2015 Jignesh Patel

sfeatured-vldb-hawaiiIt was a memorable 41st Annual International Conference on Very Large Data Bases (VLDB). VLDB is the oldest database conference, tied with the other venerable database conference, SIGMOD, which became a conference in 1975.

Yes—the field of databases, as an independent community, started long ago!

Back then, we had a Turing Award winner from the community every decade. In case you are counting, it was Charles Bachman in 1973 (for IDS), Ted Codd in 1981 (for the relational model), and Jim Gray in 1998 (for transaction processing). The Turing committee must have missed a beat in the last decade, but they more than made up for it by awarding Mike Stonebreaker the Turing Award this year. In fact, the key highlight of VLDB this year was Mike delivering his Turing talk, which you can find here.

For any computer scientist, and especially those interested in how to take ideas from research to practice, Mike’s Turing talk is highly recommended. It is a beautifully woven story of an impressive bike ride across the United States and Mike’s (numerous) adventures with startups. If you listen closely, you will also find a reference to Pivotal Greenplum Database (GPDB). Mike was the inventor of Postgres and both GPDB and HAWQ are based on Postgres.

Mike’s talk really stole the thunder at VLDB, but there were other impressive talks by folks from Pivotal as well. Foremost amongst these was a talk delivered by Amr El-Helw on dealing with common table expressions (CTEs). The talk is based on the paper that you can find here, titled “Optimization of Common Table Expressions in MPP Database Systems”, and it includes Venkatesh Raghavan, Mohamed A. Soliman, George Caragea, Zhongxian Gu, and Michalis Petropoulos as co-conspirators with Amr on this work.

At its heart, CTEs provide a way to name a SQL expression and use it by reference in subsequent SQL expressions. CTEs show up often in enterprise-grade SQL, where complex queries are the norm, and CTEs are an important abstraction for tools that work on top of the database engine. Dealing with CTEs is challenging— the optimal plan for each use of a CTE instance depends on its actual use and simply inlining the CTE in the original query often leads to suboptimal query plans. There are other challenges too. When dealing with CTEs, you have to make sure that the data flow—from the execution of the CTE to the rest of the query—does not result in the actors (i.e. processes) deadlocking the system. If this all sounds complicated, it is! The beauty of this paper is that it proposes an elegant set of mechanisms to address CTEs. All of this is built into Orca, which is part of both Pivotal Greenplum and HAWQ, and both are in the process of being open-sourced. As the paper shows, the approach results in about a 2X improvement for the TPC-DS benchmark. Quite impressive!

There were also a host of other papers and keynotes associated with folks from Pivotal at VLDB. This includes two keynotes that I delivered. One at the In-Memory Data Management and Analytics Workshop in which an initial proposal is made for how hardware and software could work together to make data analytics systems more efficient and green (save the planet!). The other keynote was at the TPCTC conference, which proposed a dramatic rethink for how we create benchmarks. If you are interested in that talk, please come to the meetup.

Amongst the other papers at VLDB, we had two papers on topics related to making machine learning work better with data platforms. The first paper discussed how R makes poor use of modern hardware and points to some things that we could do to fix these issues. My collaborators, Prof. Somesh Jha and two students at Wisconsin, are working on fixing the problems that we identified in this paper. So, stay tuned. The second paper discussed how to take a simple type of relational learning, called Inductive Logic Programming, and map it to relational algebra. With it, we can run this type of learning method in a relational database engine (which really is a relational algebraic expression evaluator) and allow this class of algorithms to scale data sizes that have historically been considered prohibitive. If you are interested in this topic, then come to the meetup on.

Learn More:

About the Author

Biography

More Content by Jignesh Patel
Previous
Find Pivotal at Strata + Hadoop 2015, NYC
Find Pivotal at Strata + Hadoop 2015, NYC

Strata + Hadoop world NYC is just around the corner. As a long time sponsor, we are looking forward to this...

Next
Operations Has Plenty To Do In A Cloud-Native Enterprise
Operations Has Plenty To Do In A Cloud-Native Enterprise

In an age when the role of IT Operations is shifting faster than ever before, there's no end of jobs operat...