Ladis 2013 Keynotes
I attended as well as presented a paper at LADIS 2013, which was colocated amongst SOSP13. I volition verbalise close my newspaper inwards a later on post. Here I simply desire to portion brief summaries of the LADIS keynotes.
For us, operational excellence arises from a combination of civilization + tools + processes.
In 1995, Amazon had a unmarried web-server operation, as well as had a website-push perl script. This was managed past times a minor centralized team, named Houston.
The squad invested inwards a tool called Apollo for automating deployments. As a result, it was tardily to produce deployments. Some 2011 numbers are equally follows. Mean fourth dimension betwixt deployments: 11.6 secs, Max number of deployments inwards an hour: 1079, Mean number of hosts simultaneously receiving deployments: 10000.
Another tool for enabling continuous deployment is pipelines, which automate the path the code takes from check-in to production: packages -> version develop -> beta -> 1box -> production.
Similar to the "Andon cord" inwards Toyota that stops series work to address issues, Amazon has an Andon cord that tin locomote pulled past times the client service department. The category possessor for the Andon cord pulled needs to address immediately: the production is removed from Amazon website until category possessor addresses the problem.
Correction of errors (COE) is some other process. This is a machinery Amazon employs to acquire from mistakes. COE started equally emails documenting errors as well as what is learned frankly. Anatomy of a COE today: what happened, what was the impact, the five whys, what were the lessons learned, what are the corrective actions.
1. LADIS keynote: Cloud-scale Operational Excellence, past times Peter Vosshall, distinguished engineer
What is operational excellence? It is anticipating as well as addressing problems.For us, operational excellence arises from a combination of civilization + tools + processes.
1.1 Culture
Amazon leadership principles are:- Customer obsession
- Ownership (Amazon has a rigid ownership culture, known equally devops!)
- Insisting on the highest standards
1.2 Tools
Amazon has tools for software deployment, monitoring, visualization, ticketing, run a hazard auditing.In 1995, Amazon had a unmarried web-server operation, as well as had a website-push perl script. This was managed past times a minor centralized team, named Houston.
The squad invested inwards a tool called Apollo for automating deployments. As a result, it was tardily to produce deployments. Some 2011 numbers are equally follows. Mean fourth dimension betwixt deployments: 11.6 secs, Max number of deployments inwards an hour: 1079, Mean number of hosts simultaneously receiving deployments: 10000.
Another tool for enabling continuous deployment is pipelines, which automate the path the code takes from check-in to production: packages -> version develop -> beta -> 1box -> production.
1.3 Processes
When you lot inquire for practiced intentions, you lot are non shout out for for a change. "Good intentions don't work, mechanisms work!"Similar to the "Andon cord" inwards Toyota that stops series work to address issues, Amazon has an Andon cord that tin locomote pulled past times the client service department. The category possessor for the Andon cord pulled needs to address immediately: the production is removed from Amazon website until category possessor addresses the problem.
Correction of errors (COE) is some other process. This is a machinery Amazon employs to acquire from mistakes. COE started equally emails documenting errors as well as what is learned frankly. Anatomy of a COE today: what happened, what was the impact, the five whys, what were the lessons learned, what are the corrective actions.
two LADIS keynote Baidu: Big information as well as infrastructure
I didn't receive got much notes from this talk, but hither is an interesting tidbit to share.It is known that 90% of hardware failures are caused past times difficult disk drives. So inwards some feel retentivity is to a greater extent than reliable than the disk. 3-way retentivity replication is plenty for most applications, as well as that is what Baidu uses. Fast recovery for a replica is to a greater extent than of import at the destination of the day.
3 LADIS: Lessons from an internet-scale notification system, Atul Adya, Google
Thialfi is a notification service. Thialfi was offset presented inwards SOSP11. Since as well as hence Thialfi scaled past times several orders of magnitude. The squad has learned unexpected lessons, as well as Atul talked close these lessons.Thialfi overview: App registers for X, this is recorded at information midpoint if X is updated app gets notification. (This is much ameliorate than busy polling past times app.) Thialfi abstraction: Object unique id, as well as monotonically increasing version number 64 bit. Thialfi is built to a greater extent than or less soft-state. It recovers registration state from clients if needed.
Some lessons learned from operating Thialfi.
3.1 Lesson1: Is this matter on? Working for everyone?
You tin never know! You demand continuous testing inwards production. For example, aspect at server graphs to infer destination to destination latency. Chrome sync was the offset existent client for Thialfi. For a large client similar Chrome, it is fifty-fifty possible to monitor Twitter for complaints.3.2 Lesson2: And you lot idea you lot could debug?
In such a large scale system, you lot receive got to log selectively. When a specific user has problem, it may aspect similar searching for a needle inwards a haystack. The squad had to write custom production code for some customers.3.3 Lesson3: Clients considered harmful
If you lot rely on client-side computations enticed past times the lightweight/scalable servers promise, you lot volition receive got problems amongst one-time versions of client apps. You cannot update the client code, hence don't seat code on clients.3.4 Lesson4: Getting your code inwards the door is important
Build a characteristic if alone customers attention close it. A corollary is that you lot may demand soiled features (weakest semantics) to acquire customers. For example, inwards ane case, when they constitute that version numbers were non viable for many systems, they modified Thialfi to permit fourth dimension instead of version numbers.3.5 Lesson 5: You are edifice your castle on sand
Use non-optimal consistent hashing (not geo-aware), rather than optimal but flapping/dithering optimal balancing.3.6 Lesson 6: The client is non ever right
The illustration given hither was amongst honour to strict latency as well as SLAs.3.7 Lesson 7: You cannot anticipate the difficult parts
Hard parts of Thialfi truly turned out to be:- Registrations: getting client as well as information midpoint to grip on registration state is hard.
- Wide-area routing.
- Client library as well as its protocol.
- Handling overload.
3.8 Question answer section
Q: Should ane pattern a service properly at the start or arrive grow organically?A: Atul said that he was a fan of designing properly inwards the offset place, but this failed for Thialfi. They revisited the pattern 3 times. His novel dominion is: if you lot are edifice on acme of other sytems (as it was the illustration amongst Thialfi), don't pass months on design.
The 3rd rewrite of Thialfi is ongoing. In this revision, they volition purpose Google Spanner for synchronous replication of the registration state!
Q: What close December 2012 Chrome crashes, did that receive got anything to produce amongst Thialfi.
A: Nothing to produce amongst Thialfi, Google Sync was blamed for it. Thialfi was non implicated inwards a PR degree failure yet.
0 Response to "Ladis 2013 Keynotes"
Post a Comment