Has DevOps Gone Off the Rails?

[article]
Summary:

DevOps is evolving with some potentially very harmful choices embedded in it. Among these are poor adoption of sound computer science, little thought to the maintainability of DevOps code, and choices of tools based solely on productivity without concern for maintainability. Will this cause DevOps to fail to live up to its potential?

DevOps is a rapidly growing philosophy within the IT community, espousing close collaboration between software developers and system operations. DevOps promotes that everyone involved in the design and construction of software should think of the production environment as their target. DevOps is how companies like Netflix and Amazon are able to maintain and enhance their software with few hiccups while serving tens of millions of people at a time. DevOps also has been touted as the new agile—as practiced by those who “do it right,” anyway.

The refrain “They did not do it right” is often used when Scrum has been implemented in an organization but the results are not what was expected. We now understand the concept of “being agile” versus “doing agile”: To “do agile” is to mindlessly follow the practices defined by an agile methodology such as Scrum—with poor results. In contrast, to “be agile” is to truly embrace agile's intentions and apply it in a way that leads to actual increased agility, quality, and value.

Is DevOps headed for a similar struggle to “do it right”? I think the direction it’s moving might cause it to not deliver on its potential. For example, among many DevOps engineers I hear constant complaints of unreliability and a continuous stream of problems with different causes. It is also clear that DevOps tools have completely ignored the hard-won lessons of computer science. DevOps users often build homegrown frameworks of great complexity, ignoring the agile value of simplicity—a value that is also paramount to reliability. Those who choose the tools seem to make their choices based on personal productivity rather than lifecycle maintainability. Finally, DevOps teams have recreated the batch job, with all of its shortcomings.

Let’s examine these challenges in more detail.

A Pattern of Unreliability

Between 2001 and 2006 I specialized in consulting for clients who had been experiencing chronic problems with IT systems but had been unable to find a common cause to fix. The root problems had to do with systems that had been created in a hurry by very smart individuals who were key to the system’s maintenance but were assigned to other efforts, leaving the systems to grow unreliable. Over time, the frequency of anomalies would increase.

These systems were not poorly designed, but they were complex—in one case, there was a PERL loop that was nested twelve levels deep!—and wholly without comments of intent within the code.

I see the same thing developing today in IT groups that are trying to practice DevOps: lots of homegrown frameworks of great complexity, with little documentation and lots of tribal knowledge needed to maintain them. This is a prescription for great headaches down the road as the gurus who created these frameworks move on.

Lessons of Computer Science Ignored

Most of today's DevOps tools rely on Ruby. Ruby is a language designed and maintained by an individual known as Matz. Recently I had to build a multithreading application and was shocked to discover that Ruby threading is broken. Ruby has threads, but they are not truly concurrent because (as of this writing) Ruby maintains a “global interpreter lock” to serialize the actions of the threads. The threads operate like an old time-sharing system, going one at a time. Sun got Java threads working in short order. Ruby is now two decades old, and threads still don't work? To be fair, Java had the support of Sun, but that is kind of the point: Java was nurtured into a robust enterprise language, with sufficient resources to achieve that.

User Comments

6 comments
Michael Munsey's picture

I would generally recommend not using multithreading in a cloud environment.  Doing so is counter to the goal of making your app horizontally scalable.  If you are doing multithreading, you scale up by adding CPUs, not by elastically adding VMs.

August 27, 2014 - 6:47pm
Clifford Berg's picture

Good point, but I only used the threading example to illustrate that ruby has some major features that do not work as expected, and that should make the reliability of the platform suspect. For infrastructure, things need to be rock solid.
Also, contrary to popular belief, multi-threading _is_ crucial for horizontal scaling. Popular frameworks such as Node use threads in the background: it is only the entry point loop that is single threaded. This is a very old design pattern, used originally for device drivers.

 

September 29, 2014 - 7:34am
Damon Edwards's picture

I was totally following with your first paragraph describing DevOps (closer collaboration, design for production, etc.). But after that you stopped talking about DevOps and just described people making bad and sloppy decisions.

None of those bad decisions you described had anything to do with DevOps.

In fact, I would say that the intense focus on reducing lead time and improving quality that are the hallmarks of the DevOps movement would catch and call out all of those bad practices immediatey (crap code, unreliability, overly complex tools, etc.). Much like many companies use Agile in name only, I think you've been see companies who are using DevOps in name only to justify bad behavior. 

I would suggest you check out conferences like the DevOps Days global series (http://devopsdays.org), DevOps Enterprise (http://devopsenterprise.io), Velocity (http://velocityconf.com), or FlowCon (http://flowcon.org) to see a large number of companies leading the DevOps charge and working in almost the exact opposite way that you described. 

September 4, 2014 - 12:06am
Matthew Skelton's picture

The focus on Ruby's thread-handling a problems for DevOps is strange. The bottlenecks I see is most organisations attempting to increase speed of delivery and improve cycle time are typically at a much higher level than an OS thread: the constraints are usually at the level of cross-programme prioritisation at the organisational level (often with a highly over-worked 'Ops' team).
There are situations in 2014 where worrying about threading is valid (high-speed, low-latency stuff like finanical markets and betting, for instance), but not where infrastructure automation and deployment are concerned: the challenges are at a higher level of abstraction. 
As Damon says, there are loads of good examples of organisations taking an effective approach with DevOps. I'd like to add another suggestion: Build Quality In, a book of Continuous Delivery and DevOps experience reports: http://buildqualityin.com/ - success stories from around the world.

September 29, 2014 - 6:39am
Clifford Berg's picture

Great comments.
Just to clarify: the example of threading was only intended to illustrate the immaturity if ruby after 20 years. One expects a language to be robust and that all the features will work. The waning unreliability of tools as system components is a contributor to overall unreliability of systems. Imagine that you have a system of 100 things, and each thing is 99% reliable: how reliable will the _system_ be?
My concern is that things are getting really complicated fast, with infrastructure coding, and not enough attention is being put on maintainability and reliability. The use of dynamic languages is only one manifestation of this. It is not always feasible to have "unit testing" for devops functions, so static languages and tools amenable to static integrity (syntactic and semantic) checking would really help, but things are going in the other direction. The companies at the forefront of devops - the Amazons and what-not - have huge amounts of money to throw at it: if they need ten more folks to build and maintain custom tools, they just hire them, and they hire the best. But other organizations cannot use this strategy, and maintainability becomes a big problem, as does key person dependencies.

 

September 29, 2014 - 7:22am
Clifford Berg's picture

I meant to say "waning reliability". (No edit function here!)

 

September 29, 2014 - 7:23am

About the author

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.