If you want secure, reliable systems, you need all stakeholders actively communicating. This means involving both IT operations and developers in discussions after deployments, to ascertain if anything went wrong and can be avoided, and what went well or could be refined. Integrating your postmortems and retrospectives facilitates collaboration and improves processes.
Developing and supporting complex systems can be a challenging endeavor. Configuration management best practices go a long way toward preventing mistakes, but things still happen. When there is an incident or service outage, the IT operations organization usually conducts a postmortem to ascertain what went wrong and how the problem can be avoided in the future.
In large organizations, this meeting typically has an operations focus. Too often, members of the development team are invited but choose not to participate, and this is a real loss for the organization. IT operations may have deep knowledge of how a system behaves in production, but the developers are the technology subject matter experts who wrote the application and know its inner requirements and other secrets. Effective organizations work to leverage the knowledge of both their operations teams and the development gurus.
Most professional organizations have a critical incident response team to manage the communication and effort to fix whatever problem has occurred. There are times when this is a very straightforward effort and the operations team can handle the outage and get the system back online very quickly. When the root cause is less than obvious, the problem management function takes over, ensuring that the right experts are working to help ascertain exactly what happened and what needs to be done to avoid similar problems in the future. If you observe the developers triaging the problem, you will gain valuable information about how the system really works. In practice, however, the teams involved often act in a siloed manner, which results in poor communication and a lack of collaboration.
In the agile methodology, there are also meetings conducted by the development team called retrospectives, which are held to discuss what went well and what could be improved, usually related to application deployments. The retrospective has a different rhythm and feel compared to the postmortem, but both have the purpose of improving our processes. The key to success is ensuring that your postmortems and your retrospectives are aligned to get maximum input from all the key stakeholders. The place to start is sharing knowledge.
Incidents and problems may be the result of human error, due to a lack of either procedures or automation to streamline the deployment process. The operations team knows how the application behaves in production, but it is the developers who know the architecture and technical stack and understand how the system was actually constructed.
If you want secure and reliable systems, you need to get all the stakeholders actively participating in sharing knowledge. Sometimes this requires changes to the code, often a bug fix. But sometimes there is a much simpler requirement to understand the underlying technical runtime dependencies.
The simple example is monitoring your disk, memory, and CPU usage, but there are plenty of other system resources that can impact your production environment too. The developers understand these dependencies but sometimes forget to communicate them to the operations team—and even more rarely actually document these requirements. The challenge is to harness your resources to get the information you need to ensure secure and reliable systems.
Many organizations adhere to the well-respected ITIL v3 framework’s practices around postmortems and incident and problem management. Some may even have environment and event monitoring. But few actually integrate their agile retrospectives with their operational practices, and this is a big loss. What organizations should be doing is automatically triggering an agile retrospective whenever there is a need to conduct a postmortem. These meetings share a common goal of improving process, but their approach, in practice, tends to be very different, draws a different audience, and yields key information that the entire team really needs to learn from.
The agile retrospective tends to be conducted after a major release, but there is no real reason they cannot used even for minor outages, including bug fixes. The key is that the retrospective focuses on what went well and what can be improved. If you want to really benefit from these discussions, everyone on the team must feel safe giving their input, even if they admitting that they made a mistake. The postmortem is a different approach, so both discussions need to happen and, ideally, should be integrated.
Prior to a release or other change to the production environment, most organizations conduct a change control meeting. Too often these meetings are focused simply on the calendar and often fail to really assess and analyze technical risk. Your organization should benefit from the effective communication that comes from well-integrated meetings involving all the key stakeholders. You need to continuously assess and improve your own processes.
Many companies get mired in one way of doing things. Stakeholders often say they have a process that approaches these issues in a specific way, implying that is just the way they do things. If you want to be successful, your organization needs to be capable of modifying your procedures to achieve the best results. The best process improvement is agile, continuous, and constantly adaptive.
As you mention in your article, post mortems and agile retrsoepctives can complement each other for continuous improvement. Both serve a purpose and I see the benefits from doing them.
One thing though, you mention that retrospectives focus on what went well and what should be improved. This is one of the 100++ ways to do a retrospective, and often not the most valuable way to do them.
There's a Retrospective Exercises Toolbox which povides many ways to do them. Some actually use similar techniques as post mortens. i.e. five times why or value stream analysis.
Author Getting Value out of Agile Retrospectives
thanks Ben - I agree fully that there are lots of ways in which folks conduct retrospectives and I appreciate your excellent insight. That said, there is a significant difference between postmortems as they are often handled by Ops and retrospectives as commonly implemented by agile teams. From a DevOps perspective, I am looking to enable the synergy between the two perspectives. Again, thanks for your comment!
Thanks Bob for your reaction on my comment.
I've done project post mortems, where we evaluated how the project went and proposed improvements on how future projects could be managed better. This is how I know post mortems, reading your reaction they are also used in operations. This is where I have done Root Cause Analysis, for instance on major outages of IT systems or repeating problems.
Retrospectives are also done more and more on release or project level, next to teams doing them every sprint.
I like how you connect the two in DevOps. There indeed is synergy, these techniques help teams and organizations to improve continuously.
Each organization has its own culture and especially its own communication style. The key is to understand how each group performs their required activities and look for ways to improve communication and collaboration. I listen to the rhythm of the communication without any assumptions or preconceived notions.
I looked at your toolbox and the part about getting retrospective actions done is much appreciated. We stopped doing retrospectives because of that reason and because management didn't want to hear all the negatives. I think at the time the decision was made it was OK because we were a small team and in the retrospective we did not uncover anything new. Now our team not only tripled in size within a very short period, but also was organized into five somewhat disconnected teams working on the same software. I will ask for retrospectives to be done again because now we probably get more out of it than we did before....if we get the actionable items done.