Facebook and the limits of DIY distributed systems

March 27, 2019 Derrick Harris

This post originally appeared as part of the March 21 Intersect newsletter. Click here to view the whole issue, and sign up below to get it delivered to your inbox every week.

“Yesterday, as a result of a server configuration change, many people had trouble accessing our apps and services.”

Facebook’s explanation of its 14-hour outage last week sounds simple enough, but very possibly belies an incredibly complex series of failures across its incredibly complex infrastructure that spans data centers across the world. Fourteen hours is an awfully long time for a company whose systems are more or less designed to maximize uptime, and that employs some of the smartest software engineers on the planet.

But Facebook is hardly alone in suffering lengthy outages caused by seemingly inconsequential things. Just about every large website, web company and cloud provider has been through the same thing, including AWS, Google, Microsoft and Apple. At their scale and with the complexity of their architectures—physical and software—all the automation and engineers in the world sometimes aren’t enough. One thing goes wrong, and it cascades.

This is one of the reasons why some people have a difficult time understanding, or at least accepting, the rush toward microservices architectures and all things Kubernetes. As the saying goes, “Shit happens.” When it does, it’s probably easier to debug a relatively simple monolith than to track down the cause across a collection of interconnected microservices running on ever-changing infrastructure.

That being said, when a company’s software footprint, user count and ambitions reach a certain scale—things that are almost certainly true for any large enterprise—microservices (done right) are almost certainly the right option for bringing order and agility to its IT organization. Depending on its application portfolio, Kubernetes might be, too. Companies like Facebook and Google don’t operate globally distributed systems and build the tools they build because they want to; they do it because they have to.

Of course, there are also business benefits to these types of architectures when they’re done well. Google’s just-announced streaming gaming service is perhaps an extreme example, but the software engineering culture and technologies the company has put in place do help it jump into new digital opportunities when it sees an opportunity.

However, the trick for most mainstream enterprises is taking advantage of the architectural lessons large web companies have taught the world (and the software they’ve developed) without taking on their do-it-yourself and/or not-built-here attitudes. Finding the budget, the people and, frankly, the institutional DNA to tackle every part of enterprise IT is hard work (thus the upcoming PagerDuty IPO). For example, standing up a Kubernetes cluster might be easy enough; operating it and all the complementary components at any reasonable scale, security level, etc., can prove to be a different story.

That’s why there’s a raging debate over open source licensing happening right now, but the gist of the argument is who has the right to serve enterprise customers with commercial versions of popular projects.

The great message of Amazon CTO Werner Vogels in the early days of cloud computing was that companies shouldn’t invest in “undifferentiated heavy lifting,” by which he meant managing data centers and provisioning servers. The message seems to have resonated (if the success of AWS and its peers is any indicator), only now that heavy lifting has shifted to operating complex data center software and application architectures. Technologies like Kubernetes (or Hadoop or OpenStack before that) might not cost anything to install, but that’s where the free lunch ends.

Perhaps the rash of recent outages at webscale services, including Facebook, will be a useful reminder for enterprises to not fall into that old trap.

About the Author

Derrick Harris is a product marketing manager at VMware.
More Content by Derrick Harris

The CIO's guide to CI/CD

Continuous integration and continuous deployment are fundamental practices in the types of modern, agile so...

Don't jump into AI without doing your homework

Artificial intelligence can be difficult to grok. The best place to start is to get up to speed on what AI ...

Facebook and the limits of DIY distributed systems

About the Author

Previous

Next

Facebook and the limits of DIY distributed systems

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.