Last Updated: 2010-12-30 20:17:27 UTC
by Rick Wanner (Version: 1)
Abel Avram has posted an interesting analysis of the causes and solutions of the December 22nd Skype outage that affected millions of users.
In short the outage was caused by a bug in the undelivered message code. This bug had been fixed in a subsequent version, but 50% of Skype users were still using the buggy version. With Skype being a peer-to-peer application, and 40% of Skype clients crashing when the undelivered messages attempted delivery, it caused undo strain on the remaining Skype users' machines. These clients then left the Skype network to protect themselves; thus causing a cascading network failure.
Most interesting are the lessons, which in retrospect seem a little obvious:
- "One important lesson to be learned is this: many users do not update their software if they don’t have to...". Apparently Skype is considering a Google Chrome style invisible update.
- "Skype deciding to review their “testing processes to determine better ways of detecting and avoiding bugs which could affect the system.”"
- “will keep under constant review the capacity of our core systems that support the Skype user base, and continue to invest in both capacity and resilience of these systems.”
Patching, testing; and adequate capacity. Aren't these pretty much the cornerstones of effective IT?
-- Rick Wanner - rwanner at isc dot sans dot org - http://namedeplume.blogspot.com/ - Twitter:namedeplume (Protected)