Megastore: Providing Scalable, Highly Available Storage For Interactive Services

Google's Megastore is the structured information shop supporting the Google Application Engine. Megastore handles to a greater extent than than iii billion write in addition to twenty billion read transactions daily in addition to stores a petabyte of master copy information across many global datacenters.

Megastore tries to render the convenience of using traditional RDBMS with the scalability of NOSQL: It is a scalable transactional indexed tape director (built on locomote past times of BigTable), providing total ACID semantics inside partitions but lower consistency guarantees across partitions (aka, entity groups inwards Figure 1). To accomplish these strict consistency requirements, Megastore employs a Paxos-based algorithm for synchronous replication across geographically distributed datacenters.

I withdraw hold some problems with Megastore, but I relieve them to the halt of the review to explicate Megastore first.

Paxos

Megastore uses every bit I mentioned earlier, hence I volition non endeavour to.) The argue Paxos got much to a greater extent than pop than other consensus protocols (such every bit 2-phase in addition to 3-phase commit) is that Paxos satisfies security properties of consensus fifty-fifty nether asynchrony in addition to arbitrary message loss. This does non conflict with the coordinated laid on in addition to FLP impossibility results. Those impossibility results said that you lot can't accomplish consensus (both security in addition to liveness at the same time), they did non say you lot withdraw hold to sacrifice security nether message-losses or asynchrony conditions. So Paxos preserves security nether all weather condition in addition to achieves liveness when weather condition meliorate exterior the impossibility realm (less message losses, some timing assumptions start to hold).

Basic Paxos is a 2-phase protocol. In the develop stage the leader replica tries to instruct the other nonleader replicas recognize it every bit the leader for that consensus instance. In the withdraw hold stage the leader tries to instruct the nonleader replicas withdraw hold the vote it proposes. So basic Paxos requires at to the lowest degree 2 circular trips, in addition to that is rattling inefficient for WAN usage. Fortunately, at that topographic point has been several Paxos variants to optimize the performance. One optimization is MultiPaxos, which permits single-roundtrip writes past times basically piggybacking the develop stage of the upcoming consensus event onto the withdraw hold stage of the electrical flow consensus instance.

Another optimization is for optimizing the terms of reads. In basic Paxos, a read functioning likewise needs to decease through the 2 stage protocol involving all the replicas (or at to the lowest degree a bulk of them) to endure serialized in addition to served. The read optimization enables serving reads locally but solely at the leader replica. When a nonleader replica gets a read request, it has to forwards it to the leader to endure served locally there. The read optimization was made possible past times having the leader impose a lease on beingness a leader at other replicas (during which the replicas cannot withdraw hold some other leader's develop phase). Thanks to the lease the leader is guaranteed to endure the leader until the lease expires, in addition to is guaranteed to withdraw hold the most up-to-date sentiment of the organisation in addition to tin serve the read locally. The squeamish affair virtually the MultiPaxos in addition to local-read-at-the-leader optimizations are that they did non modification whatever guarantees of Paxos; security is preserved nether all conditions, in addition to progress is satisfied when the weather condition are sufficient for making progress.

Megastore's role of PaxosMegastore uses Paxos (with the MultiPaxos extension) inwards a pretty criterion way to replicate a write-ahead log over a grouping of symmetric peers. Megastore runs an independent event of the Paxos algorithm for each log position. The leader for each log seat is a distinguished replica chosen amongst the preceding log position's consensus value. (This is the MultiPaxos optimization I discussed above.) The leader arbitrates which value may role proposal seat out zero. The outset author to submit a value to the leader wins the correct to inquire all replicas to withdraw hold that value every bit proposal seat out zero. All other writers must autumn dorsum on two-phase Paxos. Since a author must communicate with the leader before submitting the value to other replicas, the organisation minimizes writer-leader latency. The policy for selecting the side past times side write's leader is designed some the observation that most applications submit writes from the same share repeatedly. This leads to a elementary but effective heuristic: role the closest replica.

However, inwards add-on to the straightforward MultiPaxos optimization above, Megastore likewise introduces a surprising novel extension to permit local reads at whatever up-to-date replica. This came every bit a big surprise to me because the best anyone could produce before was to permit local-reads-at-the-leader. What was it that nosotros were missing? I didn't instruct how this was possible the outset fourth dimension I read the paper; I solely got it inwards my mo hold back at the paper.

Coordinator, the rabbit pulled out of the hatMegastore uses a service called the Coordinator, with servers inwards each replica's datacenter. A coordinator server tracks a laid of entity groups (i.e., partitions mentioned inwards the outset paragraph) for which its replica has observed all Paxos writes. For entity groups inwards that set, the replica is deemed to withdraw hold sufficient the world to serve local reads. If the coordinator claims that it is upward to date, hence the corresponding replica tin serve a read for that entity grouping locally, else the other replicas (and a distich network roundtrips) demand to endure involved.

But how does the coordinator know whether it is upward to appointment or not? The newspaper states that it is the responsibleness of the write algorithm to maintain coordinator the world conservative. If a write fails on a replica's Bigtable, it cannot endure considered committed until the group's fundamental has been evicted from that replica's coordinator. What does this mean? This way that write operations are penalized to meliorate the performance of read operations. In MegastorePaxos, before a write is considered committed in addition to laid to apply, all total replicas must withdraw hold accepted or had their coordinator invalidated for that entity group. In contrast, inwards Paxos a write could endure committed with solely a bulk of replicas accepting the write.

Performance problemsUsing synchronous replication over WAN of course of report takes its toll on the performance. This has been noticed in addition to discussed here.

Of course, at that topographic point is likewise the performance degradation due to waiting for an acknowledgement (or fourth dimension out) from all replicas for a write operation. This likewise leads to a write availability problem. The newspaper tries to defend that this is non a big employment inwards exercise every bit follows, but it is evident that partitions/failures final result inwards write unavailability until they are recovered from.

"In the write algorithm above, each total replica must either withdraw hold or withdraw hold its coordinator invalidated, hence it powerfulness look that whatever unmarried replica failure (Bigtable in addition to coordinator) volition crusade unavailability. In exercise this is non a mutual problem. The coordinator is a elementary procedure with no external dependencies in addition to no persistent storage, hence it tends to endure much to a greater extent than stable than a Bigtable server. Nevertheless, network in addition to host failures tin nonetheless brand the coordinator unavailable.

This algorithm risks a brief (tens of seconds) write outage when a datacenter containing alive coordinators all of a precipitous becomes unavailable--all writers must hold back for the coordinator's Chubby locks to expire before writes tin consummate (much similar waiting for a master copy failover to trigger). Unlike later a master copy failover, reads in addition to writes tin decease along smoothly piece the coordinator's the world is reconstructed. This brief in addition to rare outage opportunity is to a greater extent than than justified past times the steady the world of fast local reads it allows."

In the abstract, the newspaper had claimed Megastore achieves both consistency in addition to availability, in addition to this was a reddish flag for me, every bit nosotros all know that something has to give due to CAP theorem. And higher upward nosotros withdraw hold seen that write availability suffers inwards the presence of a partition.

Exercise question

Megastore has a boundary of "a few writes per mo per entity group" because higher write rates volition crusade fifty-fifty worse performance due to the conflicts in addition to retries of the multiple leaders (aka dueling leaders). Is it possible to adopt the partitioning consensus sequence numbers technique inwards "Mencius: edifice efficient replicated the world machines for Wide-Area-Networks (WANs)" to alleviate this problem?

Additional linkshttps://christmasloveday.blogspot.com//search?q=paxos-taught
https://christmasloveday.blogspot.com//search?q=paxos-taught https://christmasloveday.blogspot.com//search?q=paxos-taught