In IT we have always worked hard to ensure that systems remain available, stable and “up”. But the manner in which we achieve this outcome has changed. Gone are the days of trying to engineer failure “out of the system”—as that proved to be beyond human capability. Rather, by turning the problem on its head and accepting that failure happens—we can tolerate them better.
This means better systems, higher availability—and more sleep!
In this episode we look into a new open-source project called Chaos Lemur that can help make your systems more resilient, by testing them constantly.
- Subscribe to the feed
- Feedback: email@example.com
- Links Referred to in the Show:
Welcome to the Pivotal Perspectives Podcast, the podcast at the intersection of Agile, Cloud and Big Data. Stay tuned for regular updates, technical deep dives, architecture discussions and interviews. Now let’s join Pivotal’s Australia & New Zealand CTO Simon Elisha for the Pivotal Perspectives Podcast.
Hello everyone and welcome back to the podcast. Great to have you back. I’ve written of a little episode that I think you’ll like. I hope so. You will be the judge ultimately, if you keep listening or not. Today I want to talk to you about a new open source project called Chaos Lemur. In addition to the growing Chaos family in the IT world that you may have heard about. But let’s take a step back and talk about what the domain is that we’re talking about and why we are talking about Lemur.
In IT, we for a long time have always faced the challenge of availability, system availability. We want, as users and consumers of services, for them to just always be there no matter what happens. Anyone whose worked closely with infrastructure be it networking, be it storage, be it server, etc. knows that bad things happen to good people and bad things happen to good data centers, and bad things happen to bad data centers too. So it can be very challenging to keep systems up and running in the face of hardware failure, configuration failure, outages and the like.
We’ve gone through a huge evolution over the years in terms of the way we think about system availability and the way we put in place checks and balances to make sure systems remain up or operational. Certainly the more conventional way, and I lived through this myself personally in the 90’s and 2000’s, was what I like to call the golden screwdriver approach to high availability, which was to layer upon the system additional components, typically some form of clustering itself would require its own network infrastructure, it’s own heart-beating, a lot of integration steps into the application to make sure it would be well-monitored and understood so you could restart components that were often the horrors of shared storage to deal with etc. So this tends to be very very complicated, often very expensive using a lot of project staff. While it worked and worked well, it was complicated and expensive and difficult.
So typically it was applied to large systems and in general the database tier of large systems. So I personally worked on many large trading systems that had oracle backings in the day and applying a lot of clustering technology to those to make sure they worked and if you pulled any component out it would be okay and that sort of stuff. People still were kind of not feeling great about this situation.
What’s changed over the years and this has come out really from the Internet boom and the change of architecture move and embracing a real genuine distributive architecture. Is a true and real understanding that in anything of any scale or any meaningful scale something is going to fail. We live in complex environments, we live in challenging environments, we have moved to more commodity hardware which is built to a different price point and built to a different quality level, although it’s good, it’s not designed to run all the time without anything ever going wrong.
It’s designed to run most of the time without anything ever going wrong. So when we recognize that things will fail, we can do one of two things, we can kind of take that original approach which I’ve mentioned of “I’m just going to protect against everything failing and I’m just going to make sure it can’t fail.” Anytime you hear a human being saying “this can’t happen” you can pretty much guarantee it will happen. Human beings are famous for the hubris and history of saying “unsinkable ship” or the “earthquake-proof building” or the “bridges that can’t fall down” or the “uncrushable car”. We’re all famous for that sort of hubris so the approach that I’ve sort of adopted personally and I recommend to others is to accept that things will fail and things will break. Things are not perfect.
Once you kind of cross that rubicon and accept that as an architectural value and design principle, you’re freed from that worry and that strain, because you suddenly embrace the concept of component value and you start to design for it rather than designing against it. That’s a really nuanced concept, so just take a moment and think about that. Designing for value rather than designing against value. So you’re not fighting the value that it’s going to have. Try and stop it because I’m going to control you because I’m the developer, I’m the architecture, I’m the programmer, I can change the world. No you can’t, stuff is going to happen.
By designing for it and understanding that it will happen, you’ll cope with it far more gracefully. So if you understand that a particular node may disappear suddenly you can implement your new application mode such that you can tolerant that gracefully. Maybe it will not error out when it hits a service that doesn’t exist. Maybe it will have some sort of intelligent back off retry mode. Maybe it will switch services say if I can’t get an answer from service A, I can get a different answer from service B that may keep my customer happy.
Essentially it’s about presenting a smooth and consistent base or experience to the end user or customer. Whilst on the back end also its the cool stuff that’s happening. It’s the classic swan gracefully drifting across the lake while the feet underneath are going like mad.
So one of the things to do if you embrace and really get into this methodology and this concept of design, etc. Is to say if things are going to fail I’m going to be aggressive in way I test for this failure. I’m going to not just say that well things might fail from time to time and it’s a really bad day when that happens. I’m going to say that things will fail all the time and not only that I’m going to get ahead of the game and I will induce value into my system.
Again, this is a mind shift change, I used to do a lot of work in disaster recovery and business continuity type environments. I’ve talked to customers about what I’ve set up and I was very proud of it. Active-passive replication and I said “Oh what if I turn this system off now?” They go no no don’t do that. We need change control. Well you haven’t go a resilient and robust architecture. So the good folks at Netflix and many others have pioneered the concept of having tooling in your environment that does this kind of capriciously and will turn things off. I think it’s fantastic.
So, the most famous one is Chaos Monkey, it turns off virtual machines with various other members of the Simian Army out there. What our local Cloud Foundry team recognized was that they needed something that was not so tied to one particular provider. In that case it was very closely tied to some of the technologies in Amazon Web Services. They wanted something more cross-platform, that would work on VM-ware, that could work on OpenStack, work on Duo, work on a number of different providers.
So they created a new project called Chaos Lemur, L-E-M-U-R. Now I’m probably using my very Australian accent for that, so those of you overseas probably say it in very different ways. If you’ve got children, you’ve seen Madagascar, you know what the lemur is all about. They’ve released that as an open-sourced project that you can use, build upon, contribute to, etc. which is really exciting. Chaos Lemur focuses on addressing a particular user base which is that of systems that are deployed using BOSH. I’ve talked about BOSH in the past and again, BOSH doesn’t stand for anything meaningful. It actually stands for the bosh outer shell, which is BOSH. But BOSH does something very cool, which is help deploy distributive systems very very intelligently across very large infrastructure in a very eclectic way. Of course, BOSH underlies Pivotal Cloud Foundry and allows us to do cool things like the deployments we do, the rolling upgrades, canary deploys, service deployment.
What Chaos Lemur does is it is able to query BOSH to understand all the virtual machines that are out there. It will then schedule them all for destruction or termination. Allowing you to see what happens in your system and what happens in their system when that takes place, what the recovery looks like, what the outage looks like. See what services fail that you didn’t think would fail etc. It’s quite a sophisticated tool so let me walk you through a little.
So the first we’ll do is go and query the BOSH system to understand what’s out there and then it will, in a semi-random way, figure out what it wants to kill. When it does that, you can do a run that’s a dry run and just see what would have happened, what would it have killed just to get yourself sweating a little bit just to see what was going on. You can also exclude certain systems from on the run if you want to, not that I would recommend that. I think you should test everything equally. Cheating, but some people like to do that, to blacklist things that go down.
When you run it, it will actually destroy those virtual machines. This means that you can see what’s going to happen in that event. The other thing it does, is it allows you to have different termination probabilities. So, you can choose for example a probability of 0.1 which means that a virtual machine would be destroyed on every ten times that it’s considered for deletion. Because one of the things that Chaos Lemur does really effectively, is it doesn’t treat this termination as a one-shot exercise that you do manually yourself, although you can. It actually schedules it to happen once every hour. All the time. You should be running this sort of stuff in production time.
Now if you say oh we can’t do that, again that would challenge your thinking. Think about how resilient you want to show the product to be. It is compelling when you show the senior executives that you are terminating components in your environment all the time and your environment is up and running. You’ve turned what were severity 1 system-down run-for-the-hills type events to severity 4 warning-only or notification-only and that’s something took place and it was automatically recovered.
This is really really powerful. Now of course above and beyond that default once per hour, you could trigger it whenever you want. It has a REST API, so you can invoke it very simply. It’s kind of a simple but elegant solution because it doesn’t need to be any more complicated than it needs to be. That’s always the treat. It does have things like reporting functionality and other really cool components as well. Of course if its available in GitHub, I’ll put the link in the show notes. You can should to expand it in any way you want. Do report request and you can contribute to the project because I think it’s very worthwhile because it’s a very handy thing.
So again, think about how you test your system and if you are using any system based on BOSH or using BOSH, you can take advantage of the Chaos Lemur to do that work for you. I would suggest that it probably wouldn’t be too hard to turn to a few other things as well if you want to in your own environment. But remember, fighting against system failure and trying to prevent it from happening the first place is kind of a fool’s errand. You need to understand it will happen, it’s an inevitable affect of day to day life. Once you make it so trivial and so easy to tolerate it’ll never bother you again. I guarantee you will get a lot more sleep. That’s always a good thing in this world of ours.
So I hope that was useful, a really interesting project, something worth looking at and some thought provoking things as well. Again we would love to get your feedback, firstname.lastname@example.org I look forward speaking with you again soon. Thanks very much, till then, keep on building!
Thanks for listening to the pivotal perspective podcast with Simon Elisha. We trust you’ve enjoyed it and ask that you share it with other people who may also be interested. We love to hear your feedback, so please send any comments and suggestions to email@example.com. We look forward to having you join us next time on the Pivotal Perspective Podcast.
About the Author
Simon Elisha is CTO & Senior Manager of Field Engineering for Australia & New Zealand at Pivotal. With over 24 years industry experience in everything from Mainframes to the latest Cloud architectures - Simon brings a refreshing and insightful view of the business value of IT. Passionate about technology, he is a pragmatist who looks for the best solution to the task at hand. He has held roles at EDS, PricewaterhouseCoopers, VERITAS Software, Hitachi Data Systems, Cisco Systems and Amazon Web Services.More Content by Simon Elisha