Using the Netflix Simian Army: An Interview with Gareth Bowles

[interview]
Summary:

Gareth Bowles of Netflix talks about his upcoming presentation at STARWEST 2014, how the cloud is all about fault tolerance, how his time at several Sillicon Valley start-ups has prepared him for joining Netflix, and how his career turned him from a developer into a tester. 


Gareth Bowles of Netflix talks about his upcoming presentation at STARWEST 2014, how the cloud is all about fault tolerance, how his time at several Sillicon Valley start-ups has prepared him for joining Netflix, and how his career turned him from a developer into a tester. 

 

Cameron Philipp-Edmonds: Today we have Gareth Bowles, and he'll be speaking at STARWEST 2014, which is October 12 through October 17, and he is giving a presentation entitled "Release the Monkeys: Testing using the Netflix Simian Army." To start things off, Gareth, can you tell us a litlte bit about your role at Netflix?

Gareth Bowles: Sure, yeah. Thanks for having me. I work on the team at Netflix called Engineering Tools, where we're responsible for the productivity of our developers; we have about 700 engineers in total, and my team provides build and deployment automation, all the way from the source control all the way up to deployment and to Amazon web services.

We try to make the build and deployment as transparent as possible for our developers so they can get on with what they're good at, which is developing new features for Netflix.

Cameron Philipp-Edmonds: Okay, cool. You talk about cloud, and you're talk is that cloud is all about redundancy and fault tolerance. Out of all the things I've heard people say about cloud, I've never really heard anyone say it's all about fault tolerance. Normally people say the cloud is all about magical storage, making storage virtualization dreams come true, or that they really don't even know what the cloud actually is.

Why do you look at the cloud as a test in fault tolerance?

Gareth Bowles: Good question, yeah. Cloud is a very nebulous term, right? It means a lot of different things to different people. What I'm really getting at here is that designing your application to run in a cloud environment is all about fault tolerance because you can't assume that the platform is going to be 100% available. No matter how good your architecture is, you can't guarantee 100% at the time of your client service.

For instance, Amazon web services has had various well publicized outages that took down major websites. All hardware is eventually going to fail, too, even if you're running on your own hardware, so you have to assume any component of your app can fail at any time, and then design your apps to handle the failure as transparently as you can so that your users are not impacted.

Cameron Philipp-Edmonds:  Then you mentioned something pretty cool there. You mentioned that no component can guarantee 100% up time. With that being said, what really is a reasonable up time to be expected for a developing team, and then also, for a consumer? Are those expectations the same?

Gareth Bowles: I don't think they are, no. I don't think so, and to answer that question, it depends on what type of app you're developing, if you have any legal requirements for up time, for instance, and at the end of the day, what your customer's expectations are for the up time. I'd say in general, I'd expect a production system to have much higher uptime requirements than the developmental test system.

Cameron Philipp-Edmonds: Like you said earlier, you work with the Netflix Simian Army. Can you tell us little bit about what the philosophy is behind it?

Gareth Bowles: Building on what I was saying about designing for fault tolerance, the simian army is designed to make sure that our fault tolerant architecture actually works by introducing different types of failure but doing it in a controlled way. Rather than having to wake somebody up on a pager at 3 a.m. on a Sunday, we can do it while engineers are standing by to address any problems and run it in a scheduled way that enables us to learn from the problems we find and, in most cases, build automatic recovery mechanisms so that when we do get that failure at 3 a.m. on Sunday, nobody even notices it.

Cameron Philipp-Edmonds: Okay, and you covered the philosophy a little bit, can you briefly introduce the main members, main components, of Netflix's Simian Army?

Gareth Bowles: Sure, yes. Chaos Monkey was the one that got it all started. That monkey randomly disables AWS instances, which is one of the most common types of failure that you'll get in the cloud, just an instance goes away. To make sure that we can survive that common failure without any kind of customer impact, we used to run Chaos Monkey as a controlled experiment with engineers standing by to fix problems, but we're now comfortable enough that we make it the default for production and teams have to actually explicitly opt out of Chaos Monkey if they don't want it to go into instances in production.

Cameron Philipp-Edmonds: Okay. Why is it called Chaos Monkey?

Gareth Bowles: Really, because it's introducing chaos. Hopefully, theoretically it's introducing chaos, but if you design your apps right, then you get your fault tolerance correct, then there won't be any chaos. I guess we're trying to, we're testing for the absence of chaos is the idea is all. We have Latency Monkey. That one introduces artificial delays in communications between services. Netflix runs on a distributed service architect where we've got hundreds of little microservices all talking to each other to make up the Netflix streaming experience.

We can even make very large delays with Latency Monkey and simulate a complete service outage without actually bringing the instances down, so it's kind of an easy way to test a service outage.

And we have Conformity Monkey. That one finds instances that don't adhere to the best practices. For instance, if they're not a member of an autoscaling group, an AWS, then it will shut them down in order to give the service the chance to relaunch them properly. We have Janitor monkey. That one eliminates clutter and waste by removing unused resources like unattached DBS volumes and AMIs unassociated with running instances so that keeps our costs down and makes our environment easier to navigate by eliminating pointless resources.

Then we have some extensions of Chaos Monkey that we brought in fairly recently called Chaos Gorilla, and Chaos Kong, so those are going up in scale. Chaos Gorilla simulates the outage of an entire availability zone in the AWS, which has actually happened a couple of times. Chaos Kong goes one step further and simulates an entire region going out, a region that's made up of multiple availability zones. We use those on a scheduled basis. We're not running those all the time in production. We test our abilities to either automatically rebounce between availability zones that are still there, or completely flow over to a different region, in the case of Chaos Kong.

Cameron Philipp-Edmonds: You mentioned earlier that Chaos Monkey is kind of what started the whole Simian Army. Is that the one that's really used the most?

Gareth Bowles: Yeah, I would say probably Chaos Monkey and the house keeping ones, the Conformity Monkey and the Janitor Monkey, are the ones that we enable by default for all instances we're running in production. So people have to opt out of those and then the other ones we'll use more on a scheduled basis for doing specific tests.

Cameron Philipp-Edmonds: Okay, and now it's called Netflix's Simian Army, and Simian means referring to monkeys or apes. Are there any in the Army or in the arsenal that aren't really monkey's or apes, named after them? I

Gareth Bowles: I don't think so, no. I don't know the answer to that off the top of my head but I'm pretty sure not. We have things called a monkey, or a gorilla, or a kong if it's on a bigger scale.

Cameron Philipp-Edmonds: Are there anything other than gorilla or kong or monkey?

Gareth Bowles: No, I don't think so. Monkey is the standard unit of chaos.

Cameron Philipp-Edmonds: So a monkey is the standard unit of chaos?

Gareth Bowles: Yeah.

Cameron Philipp-Edmonds: All right, fantastic. Now, you've worked for several different Silicon Valley companies. How has working for those companies prepared you for working at a juggernaut like Netflix?

Gareth Bowles: Yeah, Netflix is definitely the biggest company I worked for. I started out in the valley working at Borland, which back in the day was a pretty good company. People who have been around for a while probably remember Borland for the developer tools that they made typically Pascal, JBuilder, Delphi, and various other tools. Borland was actually similar to Netflix in many ways. It was very engineering driven, and the engineers had a lot of freedom to set the product direction, and not too much process for implementing the products that they wanted to implement.

Netflix is very similar to that, too. It's like a big start up, in terms of having a lot of smart people who are focused on a clear product vision. It moves very fast. It can get things into production really quickly, but we also have, as you mentioned, we're a pretty large, successful company so we have a lot of resources to get things done, which is a refreshing change from working as a start up where you have to watch all your costs.

Cameron Philipp-Edmonds: Okay, you've dealt with other companies and you built upon that creativity and you got that mindset. Now that you've come to Netflix, you have the resources to make that creativity happen?

Gareth Bowles: Exactly, yeah. It's a pretty nice situation to be in.

Cameron Philipp-Edmonds: All right, perfect. Now, you started as a developer and then you moved more into the realm of testing. What led you to do that switch?

Gareth Bowles: Right, a lot of people ask me that. I think it's a very common career path for testers these days. In my case, I really found development really challenging and interesting, but I found I was only focusing on a small part of the products that I was working on, and I really wanted to see the bigger picture of what's involved in getting a product from source code all the way out into the hands of customers, and what do the customers do with it? What are we really trying to achieve here?

I found that testing gave me a much better way to understand the product from a customer's point of view, and also get a more high level view of the product and everything that it did, rather than concentrating on a small aspect of a product.

Cameron Philipp-Edmonds: There's a lot of people out there that have the misconception that testing is really just checking, but there's a lot of people who will say first hand that testing is really exploring. It's really creating value, and finding that value. Would you agree with that?

Gareth Bowles: Definitely, yeah. I think there's a place for checking, and most of the place, I think, is in automation for doing regression testing just to make sure that you didn't break something when you changed a product, but I think the real value of a tester, someone who wants to be a career tester these days, is exploratory testing and really thinking about situations that are really hard to automate or to find things that are hard to think about without taking a higher level customer focused view of a product.

Cameron Philipp-Edmonds: What is one thing you would like the attendees of your presentation, which is "Release the Monkeys: Testing using the Netflix Simian Army," to take away?

Gareth Bowles: I think one short thing is don't be afraid to test in production.

Cameron Philipp-Edmonds: Is there anything else you'd like to say to the delegates of STARWEST before they attend the conference and, of course, before they attend your presentation?

Gareth Bowles: No, I don't think so. I'm really looking forward to the conference; it will be my first STAR conference and it looks like there's some great speakers lined up. I'm looking forward to attending talks as well as giving mine, and I'm just looking forward to meeting as many people as possible. (Gareth also added that Netflix is always hiring and to find him at the conference or on twitter @garethbowles if interested in working for Netflix)

Cameron Philipp-Edmonds: All right. Awesome. That concludes our interview for today. Thank you so much for taking the time to speak with us, Gareth, and for those of you that don't know, Gareth Bowles will be speaking at STARWEST 2014, October 12 to October 17, and his presentation is titled Release the Monkeys; Testing Using the Netflix Simian Army. Thank you so much, Gareth.

Gareth Bowles: Thanks a lot.

 

garethGareth Bowles started out as a developer and later graduated to breaking other people's software instead of his own before realizing that his real passion is for shipping product faster, cheaper, and more reliably—while still getting a good night's sleep. Gareth has practiced and managed quality engineering and technical operations at Silicon Valley companies—from six-person startups to major industry players. He is currently a member of the Engineering Tools team at Netflix, where he builds cloud automation and continuous integration tools that enable any developer to build, test, and deploy the services that make up the Netflix movie and TV streaming operation. Follow Gareth on Twitter at @garethbowles.

Podcast Music: "Han Solo" (Captain Stu) /CC BY-NC-SA 3.0

About the author

Upcoming Events

Oct 13
Apr 27