Moving IT Operations into Fire Prevention Mode


Continuing to manage highly complex IT environments in a reactive mode leaves IT specialists vulnerable, when really they need to understand the actual causes and effects of what’s happening among the many technologies in use across the enterprise. Instead of constantly fighting fires, IT operations teams should aim to prevent the fires from starting.

IT operations teams are always in a tough spot. When they do their job and systems run smoothly, few acknowledge their achievement. However, everyone remembers the phone number of IT when a business system fails. Still, when IT gurus track down a root cause of an issue and it is resolved, there is a brief moment of glory when their skills are recognized. The problem is that when you chase too many fires with too few resources, someone is going to get burned. One of the biggest challenges of IT today is changing the mentality of IT operations and shifting from focusing on fighting fires to preventing the fires.

There’s a wave of new technologies that help proactively avoid issues that impact stability, performance, and security, with IT operations analytics being one of them. However, these technologies cannot succeed if IT teams are not recognized or compensated for their ability to diagnose and resolve issues quickly. IT operations organizations simply have no time to improve efficiency and agility if they are busy dealing with a constant stream of help desk requests.

Running After Problems: IT Firefighting

One of the main goals for IT organizations is to build and maintain environments that offer the highest possible availability and best performance. In the past, IT limited the number of changes happening in their environments. Eventually, optimal performance and availability in the controlled environment was reached, and reported incidents were fixed continuously. Yet as new technologies and systems were introduced, environments evolved, building on the foundation of existing layers of infrastructure and software.

The most appreciated operations specialists were those who intimately understood the architecture and configuration of these historical layers. They accumulated tremendous knowledge of past issues and were able to quickly investigate incidents by relying on their experience and instincts. IT management depended on them to maintain service-level agreements and keep the business running.

It’s easy to see evidence of the IT practice of firefighting. Just try scheduling a meeting with an IT operations specialist. Three out of four times the meeting will be rescheduled because the specialist is busy dealing with yet another production issue. The other downside to the firefighting approach is that the entire IT planning effort essentially just budgets for issues. From the outset, IT management expects that any change will require time and resources to get the environment to be stable again. Effort will be spent on incident investigation and resolution, causing new projects to take longer than expected. Delays are tolerated because performance and availability issues are just assumed to impact virtually any major project.

Why Fighting Fires Is Now Impossible

IT operations teams face increasing complexity in their environments, struggling to move beyond the perpetual firefighting mode. Complex systems present more elements to manage and more data to sort through. This complexity is typified by what has been called IT’s “big data problem,” comprising billions of machine events, environment and application changes, performance and availability metrics, and vast amounts of other structured and unstructured IT operations data from a wide variety of sources that go mostly unprocessed.

Making an already difficult management problem even harder, IT operations has to deal with a constant flow of changes. Where application updates used to be at most a monthly occurrence followed by a few weeks of stabilization in production, now, accelerated application and software deployment schedules are driving much higher-paced change activity. Implementation of agile development processes and adoption of such practices as continuous integration and continuous build have pushed the number of changes up, making it extremely difficult to keep IT environments stable and performing and raising the risk for errors that can spark incidents.

About the author

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.