Paper Review: Why Does The Cloud Goal Computing? Lessons From Hundreds Of Service Outages
This newspaper conducts a cloud outage study of 32 pop Internet services, in addition to analyzes outage duration, origin causes, impacts, in addition to ready procedures. The newspaper appeared inwards SOCC 2016, in addition to the authors are Gunawi, Hao, Suminto Laksono, Satria, Adityatama, in addition to Eliazar.
Availability is clearly real of import for cloud services. Downtimes displace fiscal in addition to reputation damages. As our reliance to cloud services increase, loss of availability creates fifty-fifty to a greater extent than meaning problems. Yet, several outages occur inwards cloud services every year. The newspaper tries to reply why outages nonetheless accept identify fifty-fifty amongst pervasive redundancies.
To reply that large question, hither are the to a greater extent than focused questions the newspaper answers first.
It turns out they precisely used Google search. They identified 32 pop cloud services (see Table 1), in addition to so googled "service_name outage calendar month year" for every calendar month betwixt Jan 2009 in addition to Dec 2006. Then they went through the commencement xxx search hits in addition to gathered 1247 unique links that pull 597 outages. They so systematically went through those post-mortem reports. Clever!
The newspaper says that this survey was possible "thanks to the era of providers' transparency". But this equally good constitutes the caveat for at that spot approach equally well. The results are alone equally skillful equally the providers' transparency allowed. First, the dataset is non complete. Not all outages are reported publicly. The newspaper defines "service outage" equally an unavailability of total or partial features of the service that impacts all or a meaning publish of users inwards such a means that the outage is reported publicly. Second, at that spot is a skew inwards the dataset. The to a greater extent than pop a service is, the to a greater extent than attending its outages volition gather. Third, outage classifications are incomplete due to lack of information. For example, alone 40% outage descriptions divulge origin causes in addition to alone 24% divulge ready procedures. (These ratios are disappointingly low.) And endure origin causes are sometimes described vaguely inwards the postmortem reports. "Due to a configuration problem" tin imply software bugs corrupting the configuration or operators setting a wrong configuration. But inwards this case, the newspaper chooses tags based on the information reported in addition to role CONFIG tag, in addition to non the BUGS or HUMAN tags.
In companionship non to discredit whatever service, the newspaper anonymizes the service names equally category type followed past times a number. (It is left equally a fun practise to the reader to de-anonymize the service names. :-)
If nosotros consider alone the worst yr from each service, 10 services (31%) produce non accomplish 99% uptime in addition to 27 services (84%) produce non accomplish 99.9% uptime. In other words, five-nine uptime (five minutes of annual downtime) is nonetheless far from reach.
Regarding the query "does service maturity help?", I got this wrong. I had guessed that immature services would receive got to a greater extent than outages than mature services. But turns out, the outage numbers from immature services are relatively small. Overall, the survey shows that outages tin hap inwards whatever service regardless of its maturity. This is because the services produce non rest the same equally they mature. They evolve in addition to grow amongst each passing year. They grip to a greater extent than users in addition to complexity increases amongst the added features. In fact, equally discussed inwards the origin causes section, every origin displace tin occur inwards large pop services almost inwards every year. As services evolve in addition to grow, like problems inwards the past times mightiness reappear inwards novel forms.
The “Cnt” column inwards Table 3 shows that 355 outages (out of the total 597) receive got UNKNOWN origin causes. Among the outages amongst reported origin causes, UPGRADE, NETWORK, in addition to BUGS are 3 most pop origin causes, followed past times CONFIG in addition to LOAD. I had predicted the most mutual origin causes to live configuration in addition to update related in addition to I was right nigh that.
BUGS label is used for tag reports that explicitly advert “bugs” or “software errors”. UPGRADE label implies hardware upgrades or software updates. LOAD denotes unexpected traffic overload that Pb to outages. CROSS labels outages caused past times disruptions from other services. POWER denotes failures due to powerfulness outages which draw organisation human relationship for 6% of outages inwards the study. Last but non least, external in addition to natural disasters (NATDIS) such equally lightning strikes, vehicle crashing into utility poles, authorities structure operate cutting 2 optical cables, in addition to similarly under-water fibre cable cut, embrace 3% of service outages inwards the study.
The newspaper mentions that UPGRADE failures request to a greater extent than inquiry attention. I intend the Facebook "Configerator: Holistic Configuration Management" paper is a real relevant elbow grease to address UPGRADE in addition to CONFIG failures.
The newspaper answers this paradox equally follows: "We discover that the No-SPOF regulation is non simply nigh redundancies, but equally good nigh the perfection of failure recovery chain: consummate failure detection, flawless failover code, in addition to working backup components. Although this recovery chain sounds straightforward, nosotros abide by numerous outages caused past times an imperfection inwards 1 of the steps. We discover cases of missing or wrong failure detection that produce non activate failover mechanisms, buggy failover code that cannot transfer command to backup systems, in addition to cascading bugs in addition to coincidental multiple failures that displace backup systems to equally good fail."
While the newspaper misses to advert them, I believe the next operate are real related for addressing the No-SPOF problem. The commencement 1 is the crash-only software idea, which I had reviewed before: "Crash-only software refers to calculator programs that grip failures past times simply restarting, without attempting whatever sophisticated recovery. Since failure-handling in addition to normal startup role the same methods, this tin increment the direct a opportunity that bugs inwards failure-handling code volition live noticed." The 2d draw of operate is on recovery blocks in addition to n-version software. While these are one-time ideas, they should nonetheless live applicable for modern cloud services. Especially amongst the electrical flow tendency of deploying microservices, micro-reboots (advocated past times crash-only software) in addition to n-version redundancy tin encounter to a greater extent than applications.
Figure 5 breaks downwards the root-cause impacts into vi categories: total outages (59%), failures of essential operations (22%), performance glitches (14%), information loss (2%), information staleness/inconsistencies (1%), in addition to security attacks/breaches (1%). Figure 5a shows the publish of outages categorized past times origin causes in addition to implications.
Only 24% of outage descriptions divulge the ready procedures. Figure 5.b breaks downwards reported ready procedures into 8 categories: add together additional resources (10%), ready hardware (22%), ready software (22%), ready misconfiguration (7%), restart affected components (4%), restore information (14%), rollback software (8%), in addition to "nothing" due to cross-dependencies (12%).
On a personal note, I am fascinated amongst "failures". Failures are the spice of distributed systems. Distributed systems would non live equally interesting in addition to challenging without them. For example, without crashed nodes, without loss of synchrony, in addition to without lost messages, the consensus work is trivial. On the other hand, amongst whatever of those failures, it becomes impossible to solve the consensus work (i.e., to satisfy both security in addition to liveness specifications), equally the attacking generals in addition to FLP impossibility results prove.
https://christmasloveday.blogspot.com//search?q=holistic-configuration-management-at
https://christmasloveday.blogspot.com//search?q=holistic-configuration-management-at
Here is an before post service from me nigh failures, resilience, in addition to beyond: "Antifragility from an applied scientific discipline perspective".
Here is Dan Luu's summary of Notes on Google's Site Reliability Engineering book.
Finally this is a newspaper worth reviewing equally a hereafter weblog post: How Complex Systems Fail.
Availability is clearly real of import for cloud services. Downtimes displace fiscal in addition to reputation damages. As our reliance to cloud services increase, loss of availability creates fifty-fifty to a greater extent than meaning problems. Yet, several outages occur inwards cloud services every year. The newspaper tries to reply why outages nonetheless accept identify fifty-fifty amongst pervasive redundancies.
To reply that large question, hither are the to a greater extent than focused questions the newspaper answers first.
- How many services produce non accomplish 99% (or 99.9%) availability?
- Do outages hap to a greater extent than inwards mature or immature services?
- What are the mutual origin causes that plague a broad arrive at of service deployments?
- What are the mutual lessons that tin live gained from diverse outages?
- 99% availability is easily achievable, 99.9% availability is nonetheless difficult for many services.
- Young services would receive got to a greater extent than outages than mature older services.
- Most mutual origin causes would live configuration in addition to update related problems.
- "KISS: Keep it uncomplicated stupid" would live a mutual lesson.
Methodology of the paper
The newspaper surveys 597 outages from 32 pop cloud services. Wow, that is impressive! One would intend these authors must live real good connected to teams inwards the manufacture to perform such an extensive survey.It turns out they precisely used Google search. They identified 32 pop cloud services (see Table 1), in addition to so googled "service_name outage calendar month year" for every calendar month betwixt Jan 2009 in addition to Dec 2006. Then they went through the commencement xxx search hits in addition to gathered 1247 unique links that pull 597 outages. They so systematically went through those post-mortem reports. Clever!
The newspaper says that this survey was possible "thanks to the era of providers' transparency". But this equally good constitutes the caveat for at that spot approach equally well. The results are alone equally skillful equally the providers' transparency allowed. First, the dataset is non complete. Not all outages are reported publicly. The newspaper defines "service outage" equally an unavailability of total or partial features of the service that impacts all or a meaning publish of users inwards such a means that the outage is reported publicly. Second, at that spot is a skew inwards the dataset. The to a greater extent than pop a service is, the to a greater extent than attending its outages volition gather. Third, outage classifications are incomplete due to lack of information. For example, alone 40% outage descriptions divulge origin causes in addition to alone 24% divulge ready procedures. (These ratios are disappointingly low.) And endure origin causes are sometimes described vaguely inwards the postmortem reports. "Due to a configuration problem" tin imply software bugs corrupting the configuration or operators setting a wrong configuration. But inwards this case, the newspaper chooses tags based on the information reported in addition to role CONFIG tag, in addition to non the BUGS or HUMAN tags.
In companionship non to discredit whatever service, the newspaper anonymizes the service names equally category type followed past times a number. (It is left equally a fun practise to the reader to de-anonymize the service names. :-)
Availability
If nosotros consider alone the worst yr from each service, 10 services (31%) produce non accomplish 99% uptime in addition to 27 services (84%) produce non accomplish 99.9% uptime. In other words, five-nine uptime (five minutes of annual downtime) is nonetheless far from reach.
Regarding the query "does service maturity help?", I got this wrong. I had guessed that immature services would receive got to a greater extent than outages than mature services. But turns out, the outage numbers from immature services are relatively small. Overall, the survey shows that outages tin hap inwards whatever service regardless of its maturity. This is because the services produce non rest the same equally they mature. They evolve in addition to grow amongst each passing year. They grip to a greater extent than users in addition to complexity increases amongst the added features. In fact, equally discussed inwards the origin causes section, every origin displace tin occur inwards large pop services almost inwards every year. As services evolve in addition to grow, like problems inwards the past times mightiness reappear inwards novel forms.
Root causes
The “Cnt” column inwards Table 3 shows that 355 outages (out of the total 597) receive got UNKNOWN origin causes. Among the outages amongst reported origin causes, UPGRADE, NETWORK, in addition to BUGS are 3 most pop origin causes, followed past times CONFIG in addition to LOAD. I had predicted the most mutual origin causes to live configuration in addition to update related in addition to I was right nigh that.
BUGS label is used for tag reports that explicitly advert “bugs” or “software errors”. UPGRADE label implies hardware upgrades or software updates. LOAD denotes unexpected traffic overload that Pb to outages. CROSS labels outages caused past times disruptions from other services. POWER denotes failures due to powerfulness outages which draw organisation human relationship for 6% of outages inwards the study. Last but non least, external in addition to natural disasters (NATDIS) such equally lightning strikes, vehicle crashing into utility poles, authorities structure operate cutting 2 optical cables, in addition to similarly under-water fibre cable cut, embrace 3% of service outages inwards the study.
The newspaper mentions that UPGRADE failures request to a greater extent than inquiry attention. I intend the Facebook "Configerator: Holistic Configuration Management" paper is a real relevant elbow grease to address UPGRADE in addition to CONFIG failures.
Single indicate of failure (SPOF)?
While constituent failures such equally NETWORK, STORAGE, SERVER, HARDWARE, in addition to POWER failures are anticipated in addition to hence guarded amongst extra redundancies, how come upward their failures nonetheless Pb to outages? Is at that spot unopen to other "hidden" unmarried indicate of failure?The newspaper answers this paradox equally follows: "We discover that the No-SPOF regulation is non simply nigh redundancies, but equally good nigh the perfection of failure recovery chain: consummate failure detection, flawless failover code, in addition to working backup components. Although this recovery chain sounds straightforward, nosotros abide by numerous outages caused past times an imperfection inwards 1 of the steps. We discover cases of missing or wrong failure detection that produce non activate failover mechanisms, buggy failover code that cannot transfer command to backup systems, in addition to cascading bugs in addition to coincidental multiple failures that displace backup systems to equally good fail."
While the newspaper misses to advert them, I believe the next operate are real related for addressing the No-SPOF problem. The commencement 1 is the crash-only software idea, which I had reviewed before: "Crash-only software refers to calculator programs that grip failures past times simply restarting, without attempting whatever sophisticated recovery. Since failure-handling in addition to normal startup role the same methods, this tin increment the direct a opportunity that bugs inwards failure-handling code volition live noticed." The 2d draw of operate is on recovery blocks in addition to n-version software. While these are one-time ideas, they should nonetheless live applicable for modern cloud services. Especially amongst the electrical flow tendency of deploying microservices, micro-reboots (advocated past times crash-only software) in addition to n-version redundancy tin encounter to a greater extent than applications.
Figure 5 breaks downwards the root-cause impacts into vi categories: total outages (59%), failures of essential operations (22%), performance glitches (14%), information loss (2%), information staleness/inconsistencies (1%), in addition to security attacks/breaches (1%). Figure 5a shows the publish of outages categorized past times origin causes in addition to implications.
Only 24% of outage descriptions divulge the ready procedures. Figure 5.b breaks downwards reported ready procedures into 8 categories: add together additional resources (10%), ready hardware (22%), ready software (22%), ready misconfiguration (7%), restart affected components (4%), restore information (14%), rollback software (8%), in addition to "nothing" due to cross-dependencies (12%).
Conclusions
The accept dwelling identify message from the newspaper is that outages hap because software is a SPOF. This is non a novel message, but the paper's contribution is to validate in addition to restate this for cloud services.On a personal note, I am fascinated amongst "failures". Failures are the spice of distributed systems. Distributed systems would non live equally interesting in addition to challenging without them. For example, without crashed nodes, without loss of synchrony, in addition to without lost messages, the consensus work is trivial. On the other hand, amongst whatever of those failures, it becomes impossible to solve the consensus work (i.e., to satisfy both security in addition to liveness specifications), equally the attacking generals in addition to FLP impossibility results prove.
https://christmasloveday.blogspot.com//search?q=holistic-configuration-management-at
https://christmasloveday.blogspot.com//search?q=holistic-configuration-management-at
Related links
Jim Hamilton (VP in addition to Distinguished Engineer at Amazon Web Services) is equally good fascinated amongst failures. In his splendid weblog Perspectives, he provided detailed analysis of the tragic halt of the Italian cruise send Costa Concordia (His newspaper titled "On designing in addition to deploying mesh scale services" is equally good a must read. Finally hither is a video of his beak "failures at scale in addition to how to ignore them".Here is an before post service from me nigh failures, resilience, in addition to beyond: "Antifragility from an applied scientific discipline perspective".
Here is Dan Luu's summary of Notes on Google's Site Reliability Engineering book.
Finally this is a newspaper worth reviewing equally a hereafter weblog post: How Complex Systems Fail.
0 Response to "Paper Review: Why Does The Cloud Goal Computing? Lessons From Hundreds Of Service Outages"
Post a Comment