First of all, what a wild ride. It sounds like a really rough week for Cloudflare engineers, dealing with such a crazy set of circumstances.
More importantly though, I personally always really appreciate how detailed Cloudflare is on post mortems. They explain in detail what happened, why it happened, and what they are going to do to prevent the same issue happening again. That’s not easy to do, both technically and swallowing your pride to admit mistakes, but they do it time and time again.
While this was clearly not caused by something that was their fault, they also acknowledge that that should not matter, because they could have done more to prevent this scenario. Taking ownership of the circumstance no matter the scenario is extremely admirable.
Other companies could learn a lot from Cloudflare on how to transparently handle these scenarios. Okta specifically comes to mind…
If you haven’t read this yet, I recommend it, it’s pretty fascinating.
Post Mortem on Cloudflare Control Plane and Analytics Outage