Why The Time is Right for MapReduce Design Patterns

December 3, 2012 Donald Miner

MapReduce Design Patterns

One of the common questions I get from people about my new book MapReduce Design Patterns is “why did you write it?” In this post, I’ll explain the reasons, as well as what MapReduce design patterns are, why they need to exist, and why the time is right.

Before getting into MapReduce design patterns, let’s talk about what a design pattern is. A design pattern in software engineering has the following properties:

General: the pattern strives to be domain independent
Reusable: the pattern is applicable to a number of different problems
Cannot be transformed directly into code: the pattern is a template for problem solving, not a solution
Follows best practices: there may be a number of ways to solve a problem, but likely the pattern is the best practice for that type of solution

Outside of MapReduce, design patterns provide a number of benefits to a community of software engineers. They:

Get the developer 80% of the way there and save some time. Knowing how to solve the problem is the majority of the battle; the developer just needs to tailor it to the domain-specific use case. The 80% rule seems to be a good middle ground. If you make the pattern any more specific, it won’t have general applicability. If it is less specific, it’s not really useful at all.
Pass knowledge from experts to beginners. New engineers can benefit from the lessons their predecessors learned. If experts spend the time to document a pattern, they can save themselves time in the future while supporting a broader audience.
Provide a common language for solutions. If problems and solutions are named, a community has a common language to discuss challenges and implementation. This enables more efficient and effective communication among members of a team.
Make the intent of code easier to understand. When you implement a pattern to solve one of your problems, it’ll it easier for other people who know the pattern to understand your code. While some software solutions are complex by necessity, when they follow a recognized template, it is far easier for others to understand them.

What is a MapReduce design pattern? Well, it’s all of the things above, but in the context of MapReduce. It is a rather constraining framework where you have to place your solutions in the terms of “map” and “reduce”.,In return, you get the benefits of abstracted parallelism and fault tolerance. The paradigm may be limiting, but it is far easier to work with — the list of different ways to solve problems is relatively short in comparison to object-oriented patterns.

So why write the book on MapReduce design patterns now? I have taught a number of courses on Hadoop, mentored several Hadoop newbies, and explained how to do things in MapReduce to more general audiences. Explaining certain approaches over and over again became really tedious, and I found there was a general lack of good, centralized, and authoritative documentation that I could point someone to. At the start, I thought writing such documentation would save myself some time, since I assumed I was just one person in one company with this problem. I soon realized that my situation was not unique in the Hadoop user community. Hadoop is sufficiently mature that there are now the right number of experts and new users for a guide to design patterns to become useful.

Not too early…

You may be asking whether the release of this book is too early in Hadoop’s evolution. After all, prematurely building design patterns can be a waste of time.

Hadoop in the past few years as a project has changed significantly, but recent changes are not as radical as they once were. For example, there was a major split between the old and the new MapReduce APIs, with the new API lacking several utilities but a revamped interface. There was quite a bit of time when deciding which one to use was a very awkward. With the release of 1.0, a significant increase in users for mission-critical purposes, mature commercial support from several companies, and more, Hadoop has now gotten to the point where it has to be stable.

Second, at this point users have had time to determine what works well and what does not. I could have come up with a bunch of patterns that nobody has ever used before, but there would be no point to it. With a more mature community of experts that has repeatedly identified and solved problems, the most common solutions have developed into design patterns.

There have been a few other places that have written about MapReduce design patterns. My favorite is a blog post by Ilya Katsov which is closest in spirit to my book. I think his approach for patterns is very similar to mine, reaffirming my belief that this is an important topic. Next, is Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, one of the first books written on Hadoop and MapReduce. It covers several design patterns, but for the most part sticks to the text-related domain of problems. I think this book was a great start, but is not general and domain-independent. Then, there are the countless mailing list posts, StackOverflow answers, blog posts, etc. that document little tidbits here and there. Nothing in the book MapReduce design patterns is new or novel. But having useful patterns diluted with marginal ones has made the gems harder to find, necessitating a guide that separates the signal from noise.

The time is right…

The Hadoop community is ready for an authoritative source of MapReduce design patterns, which I hope the contents of my book can either be or inspire. Here are the reasons why this is all coming together:

It’s not too early
Groups of engineers are building patterns independently, but having a hard time sharing them with the rest of the community
There are tons of new Hadoop users every day that could leverage experts’ documentation
MapReduce is a new way of thinking that may not be intuitive to everyone right away, so some ways to solve problems may sneak up on people
MapReduce design patterns provide a foundation for higher-level abstractions such as Pig, Hive, and who knows what else will come next

Hopefully I’ve convinced you that MapReduce design patterns are a good thing: this really has to be a community effort in the long run. Get the word out! Blog about new patterns that you discover, or perhaps talk about them at a Hadoop conference or local Hadoop meetup. This will be even more crucial as Hadoop continues to change. With the nature of data shifting towards even more challenging formats such as audio, imagery, video, and bio we’ll see some new patterns crop up to tackle the challenges of each. New libraries, tools, and abstractions will be built that will make some of the current patterns useless and in turn will open up the doors for completely new patterns. Another possibility is to just enable currently existing patterns to be implemented more elegantly. Also, with the advent of YARN, and with the rise of other Hadoop ecosystem components, the list of useful patterns for Hadoop will expand beyond MapReduce.

The only way we can keep up is to make the commitment as a community to documenting, discussing, and refining patterns for the greater good, much like the object-oriented programming community has done to great success.

About the Author

Biography

Tracker and Wazoku webinar

Last month we introduced an exciting integration between Tracker and Wazoku’s awesome idea software. Wazok...

5 Common Mistakes Made By Android Developers

We do a lot of Android software development at Xtreme Labs. Every two months we have a retrospective meetin...

Why The Time is Right for MapReduce Design Patterns

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.