Cloud Fault-Tolerance
I had submitted this newspaper to SSS'17. But it got rejected. So I made it a technical study together with am sharing it here. While the newspaper is titled "Does the cloud require stabilizing?", it is relevant to the to a greater extent than full general cloud fault-tolerance topic.
I intend nosotros wrote the newspaper clearly. Maybe also clearly.
Here is the link to our paper: https://www.cse.buffalo.edu//tech-reports/2017-02.pdf
(Ideally I would get got liked to expand on Section 4. I intend that is what nosotros volition move on now.)
Below is an excerpt from the introduction of our paper, if you lot desire to skim that earlier downloading the pdf.
-------------------------------------------------------------------------
The final decade has witnessed rapid proliferation of cloud computing. Internet-scale webservices get got been developed providing search services over billions of webpages (such equally Google together with Bing), together with providing social network applications to billions of users (such equally Facebook together with Twitter). While fifty-fifty the smallest distributed programs (with 3-5 actions) tin create many unanticipated fault cases due to concurrency involved, it seems brusque of a miracle that these web-services are able to operate at those vast scales. These services get got their part of occasional mishap together with downtimes, only overall they agree upwardly actually well.
In this paper, nosotros endeavor to answer what factors contribute most to the high-availability of cloud computing services, what type of fault-tolerance together with recovery mechanisms are employed past times the cloud computing systems, together with whether self-stabilization fits anywhere inwards that picture.
(Stabilization is a type of fault tolerance that advocates dealing with faults inwards a principled unified mode instead of on a instance past times instance basis: Instead of trying to figure out how much faults tin disrupt the system's operation, stabilization assumes arbitrary nation corruption, which covers all possible worst-case collusions of faults together with programme actions. Stabilization together with so advocates designing recovery actions that takes the programme dorsum to invariant states starting from whatever arbitrary state.)
Self-stabilization had shown a lot of hope early for beingness applicable inwards the cloud computing domain. The CAP theorem seemed to motivate the require for designing eventually-consistent systems for the cloud together with self-stabilization has been pointed out past times experts equally a promising direction towards developing a cloud computing inquiry agenda. On the other hand, in that place has non been many examples of stabilization inwards the clouds. For the final half dozen years, the rootage writer has been thinking nigh writing a "Stabilization inwards the Clouds" seat paper, or fifty-fifty a survey when he idea in that place would sure hold upwardly enough of stabilizing blueprint examples inwards the cloud. However, this proved to hold upwardly a tricky undertaking. Discounting the blueprint of eventually-consistent key-value stores together with application of Conflict-free Replicated Data Types (CRDTs) for replication inside key-value stores, the examples of self-stabilization inwards the cloud computing domain get got been overwhelmingly trivial.
We ascribe the argue self-stabilization has non been prominent inwards the cloud to our observation that cloud computing systems purpose infrastructure back upwardly to continue things elementary together with trim the require for sophisticated blueprint of fault-tolerance mechanisms. In particular, nosotros seat the next cloud blueprint principles to hold upwardly the most of import factors contributing to the high-availability of cloud services.
A mutual theme alongside these principles is that they continue the services simple, together with trivially "stabilizing", inwards the informal feel of the term. Does this hateful that self-stabilization inquiry is unwarranted for cloud computing systems? To answer this, nosotros betoken to closed to silvery lining inwards the clouds for stabilization research. We let out a tendency that fifty-fifty at the application-level, the distributed systems software starts to larn to a greater extent than complicated/convoluted equally services with to a greater extent than ambitious coordination needs are beingness build.
In particular, nosotros explore the chance of applying self-stabilization to tame the complications that arise when composing multiple microservices to furnish higher-level services. This is getting to a greater extent than mutual with the increased consumer demand for higher-level together with to a greater extent than sophisticated spider web services. The higher-level services are inwards outcome implementing distributed transactions over the federated microservices from multiple geodistributed vendors/parties, together with that makes them prone to nation unsynchronization together with corruption due to incomplete/failed requests at closed to microservices. At the information processing systems level, nosotros also highlight a require for self-regulating together with self-stabilizing blueprint for realtime flow processing systems, equally these systems larn to a greater extent than ambitious together with complicated equally well.
Finally, nosotros betoken out to a rift inwards the cloud computing fault model together with recovery techniques, which motivates the require for to a greater extent than sophisticated recovery techniques. Traditionally the cloud computing model adopted the crash failure model, together with managed to throttle the faults inside this model. In the cloud, it was viable to purpose multiple nodes to redundantly shop state, together with easily substitute a stateless worker with closed to other 1 equally nodes are abundant together with dispensable. However, recent surveys on the theme remark that to a greater extent than complex faults are starting to prevail inwards the clouds, together with recovery techniques of restart, checkpoint-reset, together with devops involved rollback together with recovery are becoming inadequate.
I intend nosotros wrote the newspaper clearly. Maybe also clearly.
Here is the link to our paper: https://www.cse.buffalo.edu//tech-reports/2017-02.pdf
(Ideally I would get got liked to expand on Section 4. I intend that is what nosotros volition move on now.)
Below is an excerpt from the introduction of our paper, if you lot desire to skim that earlier downloading the pdf.
-------------------------------------------------------------------------
The final decade has witnessed rapid proliferation of cloud computing. Internet-scale webservices get got been developed providing search services over billions of webpages (such equally Google together with Bing), together with providing social network applications to billions of users (such equally Facebook together with Twitter). While fifty-fifty the smallest distributed programs (with 3-5 actions) tin create many unanticipated fault cases due to concurrency involved, it seems brusque of a miracle that these web-services are able to operate at those vast scales. These services get got their part of occasional mishap together with downtimes, only overall they agree upwardly actually well.
In this paper, nosotros endeavor to answer what factors contribute most to the high-availability of cloud computing services, what type of fault-tolerance together with recovery mechanisms are employed past times the cloud computing systems, together with whether self-stabilization fits anywhere inwards that picture.
(Stabilization is a type of fault tolerance that advocates dealing with faults inwards a principled unified mode instead of on a instance past times instance basis: Instead of trying to figure out how much faults tin disrupt the system's operation, stabilization assumes arbitrary nation corruption, which covers all possible worst-case collusions of faults together with programme actions. Stabilization together with so advocates designing recovery actions that takes the programme dorsum to invariant states starting from whatever arbitrary state.)
Self-stabilization had shown a lot of hope early for beingness applicable inwards the cloud computing domain. The CAP theorem seemed to motivate the require for designing eventually-consistent systems for the cloud together with self-stabilization has been pointed out past times experts equally a promising direction towards developing a cloud computing inquiry agenda. On the other hand, in that place has non been many examples of stabilization inwards the clouds. For the final half dozen years, the rootage writer has been thinking nigh writing a "Stabilization inwards the Clouds" seat paper, or fifty-fifty a survey when he idea in that place would sure hold upwardly enough of stabilizing blueprint examples inwards the cloud. However, this proved to hold upwardly a tricky undertaking. Discounting the blueprint of eventually-consistent key-value stores together with application of Conflict-free Replicated Data Types (CRDTs) for replication inside key-value stores, the examples of self-stabilization inwards the cloud computing domain get got been overwhelmingly trivial.
We ascribe the argue self-stabilization has non been prominent inwards the cloud to our observation that cloud computing systems purpose infrastructure back upwardly to continue things elementary together with trim the require for sophisticated blueprint of fault-tolerance mechanisms. In particular, nosotros seat the next cloud blueprint principles to hold upwardly the most of import factors contributing to the high-availability of cloud services.
- Keep the services "stateless" to avoid nation corruption. By leveraging on distributed stores for maintaining application information together with on ZooKeeper for distributed coordination, the cloud computing systems continue the computing nodes almost stateless. Due to abundance of storage nodes, the key-value stores together with databases replicate the information multiple times together with achieves high-availability together with fault-tolerance.
- Design loosely coupled distributed services where nodes are dispensable/substitutable. The service-oriented architecture, together with the RESTful APIs for composing microservices are real prevalent blueprint patterns for cloud computing systems, together with they help facilitate the blueprint of loosely-coupled distributed services. This minimizes the footprint together with complexity of the global invariants maintained across nodes inwards the cloud computing systems. Finally, the virtual computing abstractions, such equally virtual machines, containers, together with lambda computing servers help brand computing nodes easily restartable together with substitutable for each other.
- Leverage on depression grade infrastructure together with sharding when edifice applications. The low-level cloud computing infrastructure frequently comprise to a greater extent than interesting/critical invariants together with hence they are designed past times experienced engineers, tested rigorously, together with sometimes fifty-fifty formally verified. Higher-level applications leverage on the low-level infrastructure, together with avoid complicated invariants equally they resort to sharding at the object-level together with user-level. Sharding reduces the atomicity of updates, only this grade of atomicity has been adequate for most webservices, such equally social networks.
A mutual theme alongside these principles is that they continue the services simple, together with trivially "stabilizing", inwards the informal feel of the term. Does this hateful that self-stabilization inquiry is unwarranted for cloud computing systems? To answer this, nosotros betoken to closed to silvery lining inwards the clouds for stabilization research. We let out a tendency that fifty-fifty at the application-level, the distributed systems software starts to larn to a greater extent than complicated/convoluted equally services with to a greater extent than ambitious coordination needs are beingness build.
In particular, nosotros explore the chance of applying self-stabilization to tame the complications that arise when composing multiple microservices to furnish higher-level services. This is getting to a greater extent than mutual with the increased consumer demand for higher-level together with to a greater extent than sophisticated spider web services. The higher-level services are inwards outcome implementing distributed transactions over the federated microservices from multiple geodistributed vendors/parties, together with that makes them prone to nation unsynchronization together with corruption due to incomplete/failed requests at closed to microservices. At the information processing systems level, nosotros also highlight a require for self-regulating together with self-stabilizing blueprint for realtime flow processing systems, equally these systems larn to a greater extent than ambitious together with complicated equally well.
Finally, nosotros betoken out to a rift inwards the cloud computing fault model together with recovery techniques, which motivates the require for to a greater extent than sophisticated recovery techniques. Traditionally the cloud computing model adopted the crash failure model, together with managed to throttle the faults inside this model. In the cloud, it was viable to purpose multiple nodes to redundantly shop state, together with easily substitute a stateless worker with closed to other 1 equally nodes are abundant together with dispensable. However, recent surveys on the theme remark that to a greater extent than complex faults are starting to prevail inwards the clouds, together with recovery techniques of restart, checkpoint-reset, together with devops involved rollback together with recovery are becoming inadequate.
0 Response to "Cloud Fault-Tolerance"
Post a Comment