DevOps is the interaction of many IT disciplines, including software development, operations, engineering, and site reliability engineering, into an agile cohesive unit to produce better product and seamless system integration. However, it often falls short of that vision.
It’s not surprising, once you realize these teams come with entirely different skill sets. Allare territorial in their responsibilities and obligations, thus creating the inescapable and ever-present environment referred to as a silo mentality. Yet, one of the primary functions of DevOps is to eliminate silos, with ongoing input from allperspectives throughout product development. Nice idea, but is it realistic? It has to be if the company or organization expects to remain agile, resilient, and competitive in the marketplace.
Unfortunately, the DevOps train of progress that should be running at high speed tends to derail the first time the software or system running it fails. Angry customers assail the product provider while developers and operations retreat to their respective silos to figure out what went wrong. A cohesive and effective DevOps team should be able to anticipate and avert such issues, yet disruptions still manage to occur. Now what?
Typical but Avoidable Events
One possible answer is automation of the delivery process, but the very mention of the word automation makes organizations, especially large ones, apprehensive because of cost and complexity issues. CFOs focus on the former, while team members scrutinize the latter, in particular delivery’s suitability for automation.
When an automated delivery and testing process functions as expected, it detects possible software malfunctions before they occur in production. However, CFOs may prefer to rely on human interaction for setting up test and production environments, instead of automation based on what they consider as insufficient ROI. Manual efforts can certainly discover and prevent system-threatening scenarios, but when it doesn’t, situations like the following occur.
While working as a software developer in the southeastern United States, I witnessed a borderline chaotic situation when an application on a system lacking versioned infrastructure resources and appropriate controls malfunctioned. Financial services software that had been recently updated and appeared to be running smoothly suddenly failed.
Developers and operations pointed fingers at each other while trying to establish the root cause. The immediate response was what one would expect: Roll back the software to the earlier version based on an assumption that the most recent update caused the stoppage.
Wrong! The product and system remained inoperable, much to the chagrin of the provider and, especially, its customers, who demanded a timetable for resumption of service. Understandably, they become even more distressed and irate when we couldn’t provide one.
You can roll back software, but trying to do the same with system infrastructure that is not installed with automation is practically impossible. Thus the responsibility fell upon operations to manually retrace the system setup sequence step by step. That led to an even bigger problem—one that is unfortunately typical: The system setup sequence had not been documented, so team members had to rely on their individual memories to restore the system to its previous state.
Eventually, operations discovered the cause: a security patch that disrupted and halted the system. It took nearly an entire business day to restore operations, which was an inexcusable loss, made more so by the absence of documentation.
Without automation, inadequate or nonexistent documentation unnecessarily lengthens the time it takes to install all required software in a sequence, should the operations team have to rebuild the system. Complexity in either product, applications, or both compounds the difficulty of online server restoration when there is no coding to ensure accuracy.
The absence of versioned infrastructure as code (IaC) and automated provisioning undermines one of the most important benefits of DevOps: the ability to version, manage, and control the infrastructure (servers and networking) required to run software applications in development, testing, and production. Without this automation, troubleshooting system health rules violations (e.g., a CPU operating beyond a fixed rate), anomaly detection (e.g., inconsistent CPU operations based on machine-learning predictive models run on historical data), or simply a system that isn’t operating properly, is time consuming and costly.
Automating infrastructure setup and continuous monitoring helps keep system environments stable and less susceptible to outages.
But understanding IaC is not solely the province of operations. All members of the DevOps team have a stake in it, because it’s important for the entire delivery process.
IaC in the Delivery Process
To fulfill its promise, DevOps must accomplish at least two goals: deliver optimal products to business stakeholders through systems that can be handled by a cross-functional team, and achieve scalable and elastic product architecture without affecting business and customers. The second goal becomes close to insurmountable unless teams can accelerate software delivery while reducing the possibility of product or system failures.
Two open source tools, Docker and Terraform, are capable of reducing infrastructure and environment issues while helping to accelerate the delivery process. Docker allows DevOps teams to build and manage virtual environments called containers, so applications can be deployed to many platforms, regardless of the underlying resources. Terraform is IaC software that allows teams to define, maintain, and automate configuration of the infrastructure that runs applications. It helps address outage issues by automating the process of reverting an application to a previous version, should a production outage occur.
While talk of infrastructure might seem applicable only to operations, developers also must leverage IaC and containers to provide stable and secure development and testing environments for the delivery process. In addition, automating infrastructure is itself software development, so developers are typically involved in IaC activities.
Look no further than application security for the important role developers must play in creating infrastructure. Besides assuring the security of applications through coding and testing, developers must make sure infrastructure automated for delivery is secure as well.
As explained in “Shifting Security Left,” a downloadable eGuide on this website, developers must craft security applications and DevOps environments from the beginning, rather than late in the process, in order to “ensure flaws and weaknesses are exposed early on through monitoring, assessment, and analysis, so remediation can be implemented far earlier than traditional efforts.” Once security issues are identified, the entire DevOps team should assume responsibility for resolving any issues identified.
Resolving DevOps Barriers
While I am a believer in IaC, I won’t ignore the legitimate concerns that may keep companies from committing to it. Automation is expensive, especially for legacy systems or those with hundreds of servers. Yet consider the costs of recurring but preventable outages requiring system rebuilds. That reason alone should justify an investment in infrastructure as code.
Another challenge is complexity. IaC requires extensive and well-defined planning, designing strategies, and input from all team members. I recommend the use of an automation architect who is adept at writing scalable infrastructure code and possesses the skill set to understand key components of the IaC architecture.
But the presence of an automation architect does not eliminate the need for extensive training of the entire team on IaC. People have to learn before their machines do, so train your teams first!
Effective and efficient delivery automation won’t come to fruition without expertise, collaboration, silo-free communication, and agile teamwork.