Hadoop World 2012 is just around the corner, kicking off next Tuesday, October 23, in New York City. This will be my third consecutive time at Hadoop World and it has been exciting to watch the ecosystem change and evolve over the past few years. The impact of merging the conference with Strata is very evident as you look through the schedule. The conference isn’t just about Hadoop anymore—it’s about using Hadoop for solving Big Data, data science, and other business problems. There are several interesting talks planned that have the potential of sharing knowledge and expertise between these groups. In this article, I’ll share some of the talks that I’ll be paying attention to.
First, I’ll make a few general comments on the schedule:
- I saw the term “Big Data” a lot—maybe more than Hadoop itself. This may be due to the merge with the more general-interest O’Reilly Strata conference, but in general I think this shows that more people are talking less about the technology and more about the business cases. Maybe Hadoop is reaching the point of commoditization.
- The term “Data Science” is also making headway. I think now that Hadoop is easier to use and more stable, it is becoming a more reasonable platform on which to “do data science”. This is demonstrated not only by the number of talks discussing data science business cases, but also by the increased focus on methodology.
- Hadoop isn’t just about the technical guys anymore. I’ve seen this in action while working with Greenplum customers. Many of the talks are very business oriented, which I think is a good thing. As someone who is trying to learn about the bleeding-edge technical ideas in the Hadoop space, I probably won’t be attending many of these. Regardless, this demonstrates that Hadoop is the real deal: a basis for real business opportunities that convert to real dollars.
- HBase was a hot topic at Hadoop World two years ago. It solved a number of issues preventing a Hadoop cluster from becoming more generally useful to an organization. I think this year’s hot topic is going to be real-time and streaming analytics. Batch isn’t quite cutting it anymore and more organizations are getting new and fresh data into clusters every second, instead of analyzing large historical data sets. There aren’t too many talks on this topic, but I suspect there will be a lot of chatter in the hallways of the Hilton and Sheraton.
You more or less have to watch the keynotes, but I wanted to tell you what I’m really excited about.
Pay attention to “The End of the Data Warehouse” by Platfora’s Ben Werther. Platfora has been pretty quiet about their upcoming product so far, so expect some interesting revelations at this keynote. I’ve been able to see some initial demos and it is very impressive — Platfora is going to play an important role as Hadoop evolves. Ben and company at Platfora will also present in the Hadoop: Tools & Technology track as well, with a sure-to-be interesting talk ,“Turning More Than 2PB of Raw Data into Interactive BI”.
Next is a talk by Michael Flowers, a New York City official, who has been analyzing data about the city’s residents for the greater good. Talks like this usually have some fun facts thrown out about the data, which is always interesting. However, the reason why I’m really interested in this is to understand in what ways government can be successful with Big Data and what challenges are holding government back. Government institutions have to navigate many unique privacy and budgetary concerns, so I’m excited to hear how NYC got around these issues.
I’m obviously biased, but I’m also excited to hear, Greenplum’s Annika Jimenez talk about the business and process components of making Big Data a success. In the description she states “Data Science is a team sport,” an extremely important concept we’ve learned by conducting numerous engagements with our data science team all over the world.
Dremel and Drill. There are two talks in particular that should not be missed. “Big Data for the Masses: How we Opened Up the Doors to Google’s Dremel” and “Drill into Big Data”, both of which are talking about Dremel and Drill. I personally don’t know much about these technologies yet, but I know enough that I should be paying attention. These could revolutionize the way we analyze data in a Hadoop cluster and really take it to the next level.
Video. The talk “DCGS-Army Standard Cloud Multimedia” is the only talk that mentions video data. Video (and even images) has traditionally been challenging for Hadoop to analyze out of the box, for a number of technical reasons so I’m curious to hear how the Army has figured out how to get around these issues. Greenplum’s data science team is currently working on a couple of engagements dealing with video data in Hadoop and I expect them to be doing many more in the next year. Once the community breaks through and there is more innovation around how to tackle this medium, we’ll see an explosion in video analysis.
Facebook. Most of my notes in this article are on specific technologies that I think are interesting, but you should always pay attention to Facebook’s talks. Facebook is always very transparent in how they solve their problems and I personally have learned more than a few lessons from these guys. Facebook is arguably the largest center of innovation around HBase operations and usage, given their large HBase footprint and ludicrous amount of data. Their first talk “Facebook’s Large Scale Monitoring System Built on HBase” is sure to have a number of tidbits of useful advice about deploying and using HBase.
Next from Facebook is “Taming the Object Graph”. Graph-oriented data sets are are gaining popularity recently, and for good reason. It’s a very natural and effective way to analyze data, but unfortunately it’s also technically challenging to scale and utilize efficiently. Facebook might have the largest graph in existence, so I’m really excited to see their innovations in this space.
Networking and Hardware. I’m looking forward to “Designing Scalable Network Architectures for Fast Moving Big Data” simply because the network design of these large-scale clusters is one of the most important yet overlooked parts of using Hadoop. It is a complex problem and there is plenty of room for improvement in this realm.
“Is Your Cluster a Leaning Tower of Pisa?” caught my eye not only because of the catchy title, but also because many of the clusters I’ve worked with could be likened to the Tower of Pisa. Hopefully they dive into networking a bit, but I’m also excited to hear about different hardware configurations and how they impact performance.
Hadoop a couple of years ago typically ran on commodity hardware such as web servers that were designed to serve other functions. Today, hardware vendors are tailoring hardware and networking configurations specifically to Hadoop. I can’t emphasize enough how groundbreaking this is: Hadoop is big enough now that hardware is being custom developed around Hadoop, rather than the other way around. It’s increasingly important for us all to pay close attention to the innovations in this realm.
Storm. “Realtime Processing with Storm” is something I’m really looking forward to, because we all need to be paying attention to Storm. I frequent the DC Hadoop Meetup, and there’s been a lot of talk about Storm recently. I know of a couple of organizations that are already using it in production.
Storm is a system for doing streaming analytics and streaming data processing, which fills a serious gap in the traditionally batch-oriented MapReduce model. HBase filled the ad-hoc query and some of the real-time holes that were in Hadoop, and now Storm complements MapReduce and HBase by providing streaming capabilities. I’m not sure there will be too many holes left after Storm.
Not everyone is into the vendor booth scene at conferences, but for others, it’s the best part! I really like to see what vendors are doing in this space, as they echo what they are seeing across businesses around the world. New products are exciting and expand the opportunities of what we can do.
I don’t know what the other companies will be talking about at their booths since everyone likes to stay quiet until the last minute, but I can tell you that we will definitely have some cool stuff going on at the Greenplum booth (300). On Thursday, October 25 at 10:00 am, I’ll be doing a book signing for my book “MapReduce Design Patterns” at our booth. Stop by, pick up a free copy of the book, and say hi!
You should also stop by the MapR booth. Those guys always have something really interesting going on and have something very special in store. They are willing to take some big risks to do cool stuff. Traditionally at every conference they have something new and interesting to check out!
I’ve mentioned this a few times already, but be on the lookout for Platfora! I think a lot of people are going to be impressed by what they are revealing.
The Cloudera and Hortonworks booths may not seem interesting to some at face value, since many Hadoop savvy people know what these guys are doing already. However, the booths typically have very sharp folks that like talking about Hadoop and like drilling deep into tough Hadoop problems.
Hope to see you at Hadoop World! I’m looking forward to a good time.
About the Author
BiographyMore Content by Donald Miner