99: Not Learning from Software Failures

Most of the incidents have mitigation actions and follow-up fixes applied to the system. They focus on addressing the root cause. Do these actions mean the organization/team/systems had learned from the issue?

No.

If the root cause is addressed and a few more tests are added to ensure the system won’t fail for the same reason in the future, the team has improved, but is still missing crucial learning.

They need to identify other areas with similar weak points, share the knowledge with the organization and expand the knowledge base. This is never prioritized in the rush of the feature and product delivery. It’s also invisible when a similar error triggers an incident, because people don’t look back at prior incidents to connect them. People who might remember the previous one often left the company. The team is now focused on what they have at hand.

Unless everyone has learned, unless all similar issues have been fixed, we can’t say that we have maximized learning from software failures.


This note is mentioned in:

84.

If you're unfamiliar with Zettelkasten: These notes are atomic. The aim is to have one idea in a note. The connections between notes are as important as the notes themselves.

Reply via email

or comment below.