Moving IT Operations into Fire Prevention Mode

[article]

IT operations teams are always in a tough spot. When they do their job and systems run smoothly, few acknowledge their achievement. However, everyone remembers the phone number of IT when a business system fails. Still, when IT gurus track down a root cause of an issue and it is resolved, there is a brief moment of glory when their skills are recognized. The problem is that when you chase too many fires with too few resources, someone is going to get burned. One of the biggest challenges of IT today is changing the mentality of IT operations and shifting from focusing on fighting fires to preventing the fires.

There’s a wave of new technologies that help proactively avoid issues that impact stability, performance, and security, with IT operations analytics being one of them. However, these technologies cannot succeed if IT teams are not recognized or compensated for their ability to diagnose and resolve issues quickly. IT operations organizations simply have no time to improve efficiency and agility if they are busy dealing with a constant stream of help desk requests.

Running After Problems: IT Firefighting

One of the main goals for IT organizations is to build and maintain environments that offer the highest possible availability and best performance. In the past, IT limited the number of changes happening in their environments. Eventually, optimal performance and availability in the controlled environment was reached, and reported incidents were fixed continuously. Yet as new technologies and systems were introduced, environments evolved, building on the foundation of existing layers of infrastructure and software.

The most appreciated operations specialists were those who intimately understood the architecture and configuration of these historical layers. They accumulated tremendous knowledge of past issues and were able to quickly investigate incidents by relying on their experience and instincts. IT management depended on them to maintain service-level agreements and keep the business running.

It’s easy to see evidence of the IT practice of firefighting. Just try scheduling a meeting with an IT operations specialist. Three out of four times the meeting will be rescheduled because the specialist is busy dealing with yet another production issue. The other downside to the firefighting approach is that the entire IT planning effort essentially just budgets for issues. From the outset, IT management expects that any change will require time and resources to get the environment to be stable again. Effort will be spent on incident investigation and resolution, causing new projects to take longer than expected. Delays are tolerated because performance and availability issues are just assumed to impact virtually any major project.

Why Fighting Fires Is Now Impossible

IT operations teams face increasing complexity in their environments, struggling to move beyond the perpetual firefighting mode. Complex systems present more elements to manage and more data to sort through. This complexity is typified by what has been called IT’s “big data problem,” comprising billions of machine events, environment and application changes, performance and availability metrics, and vast amounts of other structured and unstructured IT operations data from a wide variety of sources that go mostly unprocessed.

Making an already difficult management problem even harder, IT operations has to deal with a constant flow of changes. Where application updates used to be at most a monthly occurrence followed by a few weeks of stabilization in production, now, accelerated application and software deployment schedules are driving much higher-paced change activity. Implementation of agile development processes and adoption of such practices as continuous integration and continuous build have pushed the number of changes up, making it extremely difficult to keep IT environments stable and performing and raising the risk for errors that can spark incidents.

Furthermore, as organizations are forced to cut costs, including reducing headcount, IT is under pressure to keep up a hectic pace with even fewer resources. Big cost reductions are evident across large organizations. For example, just in the last few years a number of major banks have made reductions in manpower, which represents the lion’s share of any IT budget, to make their business more cost-effective. IT operations has to cope with the fact that the number of available firefighters is dropping, but the number of fires is growing. This imbalance cannot continue forever.

Why Past Approaches Fell Short

Industry best practices and processes don’t address this problem and often even contribute to the firefighting state of IT. IT teams relying on traditional change management processes can’t keep pace with the frequency of critical changes required to support environment operations. Established IT processes become more bureaucratic and basically slow down tasks. I work with organizations with very mature change processes. All have some service desk in place where requests are initiated and where the whole workflow takes place. The problem is that the end result of any process, mature or immature, is when an actual change goes into production, affecting the stability of the system. And manual processes do not provide sufficient support to handle the actual change reliably and effectively.

Furthermore, traditional tools were not designed to deal with IT’s big data problem. With the complexity of IT systems, the dynamics of IT operations, and multiple teams working in silos, IT operations needs not only to automate, but also to collect data down to the finest details. None of the traditional tools actually have been capable of this. They have not approached this state as a “big data” problem, instead providing IT operations with lots of raw data lacking insight or actionable information.

Switching to Fire Prevention Mode

Continuing to manage highly complex IT environments in a reactive mode leaves IT specialists vulnerable, when really they need to understand the actual causes and effects of what’s happening among the many technologies in use across the enterprise. The fact is that some IT firefighting is sometimes simply unavoidable. However, what really matters is what we can do to be better prepared to prevent incidents from even happening. This means we need to approach IT fires by making them limited to very occasional events rather than taking an attitude of “That’s just how we do things here.” We need to apply proactive thinking to operations.

This means adopting more reliable processes, encouraging IT operations to be aware of what happens as a result of a change early enough to easily deal with results and be better prepared to prevent undesired outcomes. This will give IT operations the breathing room to be able step back and take a more comprehensive approach to IT management, breaking the reactive cycle.

Applying automation tools to IT operations can provide teams with more time to focus on innovation and deployment of new applications. However, it is still just the first step in helping IT operations to better carry out their tasks. To become a productive, lean organization, IT operations needs to be able to look at preventing issues rather than wasting time fighting fires. One way to achieve this is to take an analytics approach to IT management.

Applying IT Operations Analytics

IT operations analytics tools take a fresh perspective on the massive amounts of operations data and environment complexity confronting operations teams. They automatically generate actionable insights that current tools don’t offer, supporting day-to-day operations decisions. For example, analytics help answer the question of whether the change that was made was too risky, if deployment completed successfully, or if environments are consistent.

IT operations analytics can help you put processes and information flow in place so you can:

  • See risks to services across silos and mitigate them before they escalate and impact users
  • See the root cause of impact to services across silos so you can eliminate wasteful use of cross-functional triage teams and speed mean time to repair of service issues
  • Reduce the cost of overall service delivery for better service impact and less time spent fixing and repairing services, less firefighting, and less disruption of IT projects and processes

With change requests and changes now coming at IT at a blinding pace, instead of reverse engineering a problem’s root cause from low-level machine events and metrics, IT operations teams can apply IT operations analytics tools to carry out a top-down analysis. This blends and reviews all the diverse IT operations data via the changes that occur, allowing for effective prevention of issues.

About the author

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.