Gavin Sherry has played a major role in developing a number of well-known data-centric products. His journey started with significant contributions to PostgreSQL. Before big data was common, he joined Greenplum to help build the massively parallel processing engine for analyzing petabytes of data. At Pivotal, he helped lead the development of HAWQ, creating a new era of data platforms with SQL on Hadoop.
As Vice President of Engineering for data products, Gavin took the time to do a Q&A with us and shares the story of how he landed at Pivotal, where the market is heading, and a bit about how his products are being used to help companies become more data driven. Beyond the amount of code and product leadership Gavin has contributed in his life, he is also a member of IANTD, the International Association of Nitrox and Technical Divers, and he is always looking for really good engineers!
Could you tell us about how you grew up and got into software?
Yes. I was born in Sydney, Australia and had a fairly usual upbringing for an Australian, spending time on the beach, playing cricket, and surfing. When I was 11 years old, I taught myself GW-BASIC. There was a manual on it that came with the first computer I ever used. By 16, I had a part time job writing C code. I went on to study English literature, math and computer science at the University of Sydney.
How did you end up working with PostgreSQL, Greenplum, and Pivotal?
Well, I started working back in 2000 with PostgreSQL, before it became one of the most widely adopted open source databases in the world. Back in those days, I was a major contributor to the project. I spent time consulting and working with a bunch of different companies while I was seeking commercial opportunities for the database.
The turning point for me was about the time we released PostgreSQL 7.3, around 2002. I was contacted by a very large retailer who wanted to shift off of expensive, proprietary legacy database systems, and they wanted to explore what was possible with open source databases. Their needs validated the importance of open source in a big way because they wanted to deploy about 4000 instances of PostgreSQL.They asked me to help with a variety of things—hardening, best practices, training, backend development and more. We rolled out a very sizable migration, and it was successful beyond my expectations. I saw the potential of open source software first hand in the enterprise and began focusing on more of this type of work.
In 2006, I was at a PostgreSQL conference, held to celebrate the 10th anniversary of Postgres going open source. The founders of Greenplum and I met, and they offered me an opportunity—we basically sealed the deal on the spot. I already knew about Greenplum. They needed a database kernel as a foundation to build their vision of a massively parallel analytics warehouse for big data. The challenge was exciting. It took a few million lines of code to get that vision to where the Greenplum Database is today. We took another year to get the product to a point where I’d have my paycheck generated out of it, as the saying goes.
Soon after that, we started to win some big deals—one early social media company, a major stock exchange, and a few others. Along the way, we also got to work with a lot of really smart people at University of California, Berkeley, the University of Wisconsin-Madison, and Stanford. As you can imagine, I’m fairly nostalgic about those early days: the odds were against us, the company was centered around a small office in San Mateo, and it was great to see something so ambitious come to life through hard work.
From there, the journey got more exciting. Greenplum was acquired by EMC in 2010 and spun into Pivotal in 2012.
What have you been focused on at EMC and Pivotal?
One of the most powerful things we built was called MADlib. It is one of the broadest parallel machine learning libraries out there, is integrated with SQL, and is open source. We use it as part of our big data products today. For data scientists, it is a very robust toolset and works with PostgreSQL, Pivotal Greenplum Database, and Pivotal HAWQ.
About six months before Pivotal launched, we started working on HAWQ, our SQL engine for Hadoop. At the time, we just called it Greenplum for Hadoop. It was an innovative project that brought the power, scale and enterprise database features of a mature SQL database to Hadoop. We had to do some major surgery with the execution engine, the transaction management system, the storage layer and metadata management. We integrated parts of Apache Hadoop® and built new capabilities on top of Apache Hadoop®. This was a real career highlight for me because it solved a big impediment to Apache Hadoop® adoption in a way that customers got instantly.
I’m really excited about some of the things we’re working on in R&D relative to HAWQ and the broadest set of technologies in the Pivotal Big Data Suite. I think we’ll have a material impact on the future of database technology next year.
Could you tell us about where things are headed based on your new role?
I recently took the role of VP of Engineering for Data at Pivotal. It’s pretty humbling to have such a responsibility—facilitating, supporting and growing a world class database team, driving deep innovation in the state of the art of database technology and ensuring that our products meet the demanding expectations of our significant customer base. It’s a truly amazing time to be working with data. I’m lucky because when you look at the Big Data Suite, with GPDB, HAWQ, Gemfire and Pivotal HD, you really have the platform to solve any and all data problems facing customers today. It sounds trite, but working with this team in this area at this point in time is really a joy. Some mornings I just dance into the office.
I’m also lucky to spend a fair amount of time with customers. Things are changing so much, I work hard to really listen to their challenges and think about what that means for Pivotal. What I hear is that they want to bring the sort of experience one associates with the infrastructure at the well known internet companies into their data centers, into their cloud environment. They see Big Data technology as a competitive advantage. They want to ask any question of the data and use it to drive innovation in their business. I believe that at Pivotal, we have the most powerful tools available today to do this.
I use that earlier example of the roll out of 4000 instances of PostgreSQL at that large retailer as a litmus test to see how far we’re moving the needle on technology.If the Big Data Suite was available back then, I could have had a much more substantial impact on the businessWe could have brought all of the data into a single, integrated system. We could have built true, data driven apps to help the retailer create more rapid, effective decisions, make better pricing decisions, improve customer loyalty, and more.
I know this is true because today we work with many retailers. With the Big Data Suite, we’re able to provide near real time information on product placement, store optimization and help connect customers with opportunities that are attractive to them. I couldn’t imagine doing that 15 years ago, certainly not with the kind of scale and responsiveness we take for granted today.
How do Cloud Foundry and other key open source capabilities fit with our data products?
The world is moving to the cloud. This is the future of enterprise computing. We have a significant roadmap oriented toward deep integration of the Big Data Suite work with Cloud Foundry as part of our distribution, Pivotal Cloud Foundry. Bringing Big Data to cloud is not straightforward. There are important challenges around preserving performance, reliability, and scale, and we have some of the brightest database and distributed systems engineers in the world working on these problems.
Pivotal is in a very special position, with product leadership in both big data and Platform-as-a-Service.
Today, Big Data Suite components are provided as data services in Pivotal CF. Developers can build apps to run against those data stores. We provide a seamless, self-serve experience, removing the need for operational operational heavy lifting. We started with Apache Hadoop® and HAWQ as part of Pivotal HD. There’s a natural progression from there, integration with BOSH for seamless provisioning and automation. The hardest problem is making data processing technologies cloud native. To do so, you need to do surgery on the architecture and that requires deep, innovative thinking.
In addition to cloud, we do a lot with Apache Hadoop®. All of our data products rely upon or integrate heavily with it. We are collaborating with Hortonworks on Apache® Ambari to make it the standard provisioning, management, and monitoring of Apache Hadoop® clusters. We have been the number one corporate contributor to Tachyon, and this has stood as one of the fastest growing project in AMPlab’s history. The Tachyon distributed memory system sits above HDFS, and we will be announcing some new and very compelling innovations here in 2015.
What are the other things that are top of mind for people working with Pivotal data technologies?
Mobile is a killer use case for big data technologies, and the Big Data Suite is an ideal platform for supporting mobile apps because of the scale—you can do all the capture and analysis without concern for situations about limitations on latency, throughput and scale. All the data can all land in one place within HDFS via Gemfire for analysis via HAWQ and data serving and operational BI with Greenplum. We also see great use of Gemfire as a scoring and real time system within that serving layer.
Mobile continues to be a big priority for corporate investment and Cloud Foundry now supports mobile backend as a service (MBaaS), tying into our data suite. The Pivotal CF Mobile Suite Gateway was released to help make mobile development much simpler, more cost effective, and highly scalable. Since it runs on a company’s own data center with Pivotal CF and Big Data Suite, companies overcome security, compliance, control, and scale concerns with public cloud solutions. Importantly, the data is all inside the company, making it accessible to integrate with other data or apps.
I’ll also say, we are aggressively hiring because of our growth. We are looking for people who want to complement a world-class team of parallel database and Apache Hadoop® engineers. We have a strong focus in the San Francisco Bay Area and Portland as well as China and India. There is already plenty of proof that we have industry leading minds, and we will continue to advance in the areas of query processing, SQL, machine learning, data replication, in memory, resource management, Apache Hadoop®, and more.
Thank you for speaking with us. Before we close, could you tell us about any personal passions—things you like to do in your personal time when you aren’t living and breathing Pivotal products?
Well, I have a lovely wife and adorable puppy that seems to be a mix of Shetland and Dachshund. I’m also a bit of an adrenaline junkie and like to dive, ski, parachute, fly and much more. If I could do only one adventure sport for the rest of my life, it would be cave diving. Some of the best dives I’ve done have been in the Yucatan Peninsula in Mexico, around Playa del Carmen. The best would have to be a place called Cenote Minotauro, or perhaps 391 foot deep cenote called “The Pit”.
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the AuthorMore Content by Stacey Schneider