Most of the incidents have mitigation actions and follow-up fixes applied to the system. They focus on addressing the root cause. Do these actions mean the organization/team/systems had learned from the issue?
No.
If the root cause is addressed and a few more tests are added to ensure the system won’t fail for the same reason in the future, the team has improved, but is still missing crucial learning.
They need to identify other areas with similar weak points, share the knowledge with the organization and expand the knowledge base. This is never prioritized in the rush of the feature and product delivery. It’s also invisible when a similar error triggers an incident, because people don’t look back at prior incidents to connect them. People who might remember the previous one often left the company. The team is now focused on what they have at hand.
Unless everyone has learned, unless all similar issues have been fixed, we can’t say that we have maximized learning from software failures.
- Related Note(s):
- Effectively learning from mistakes requires clear intention and follow-ups, not just mechanisms.
- Great debugging pursues ‘the why’, finding similar weak points requires the same mindset.
- Sharing learning publicly builds organizational resilience and accountability culture.
- An open feedback culture eliminates recurring mistakes.
- Coaching for responsibility: look holistically at how multiple factors contribute to problems.