Jeff Hammerbacher: "Hadoop Operations", Velocity 2009 Day One

June 23, 2009 Pivotal Labs

Jeff is Chief Scientist at Cloudera, which helps enterprises with Hadoop implementations.

Hadoop consists of three separate modules, which are apparently in the process of being split into separate Apache projects:

  • Hadoop Distributed File System (HDFS)
  • MapReduce
  • Common (aka Hadoop Core)

I’ll just mention some of the interesting little tidbits from the presentation:

  • Standard box spec is 1U 2x4core, 8gb ram, 4x1TB SATA 7200rpm.

HDFS:

  • Stores 128mb blocks, replicates the block
  • Good for large files written once and read many times
  • Throughput scales nearly linearly

Some examples of Hadoop-based projects:

  • Avro – cross-language data serialization
  • HBase – like BigTable
  • Hive – SQL interface, an interesting open-source data warehouse solution
  • Zookeeper – coordination service for distributed applications

Hadoop @ Yahoo: 16 clusters, each cluster is 2.5PB and 1400 nodes

Cloudera maintains convenient, stable Hadoop packages – it’s all open-source – so you don’t have to go around figuring out what version of what subproject works best with others.

Testing: Hadoop has a standalone mode, which uses a single reducer in one JVM.

Jeff mentioned that they use Facebook’s Scribe for distributed logging.

And last but not least, Cloudera has a GetSatisfaction page.

About the Author

Biography

More Content by Pivotal Labs
Previous
An easy way to write named scope tests
An easy way to write named scope tests

The project I'm working on has a lot of named scopes which are really great. If you're not using them alre...

Next
Steve Sounders: "Web Performance Analysis", Velocity 2009 Day One
Steve Sounders: "Web Performance Analysis", Velocity 2009 Day One

Quick report from Steve Sounders' workshop at Velocity 2009, current Googler, author of High Performance We...