Paper Summary. Corfu: A Shared Log Blueprint For Flash Clusters
By: Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, John D. Davis, Appeared inward NSDI'2012
Dynamo/Cassandra/Voldemort replication, but linearizability is yet guaranteed.
Previously I had summarized the Tango paper, for maintaining distributed information structures over a shared log. Tango builds on the Corfu log abstraction.
Corfu involves 3 principal functions:
The instance higher upward maps each log seat to a unmarried flash page; for replication, each extent is associated with a replica laid of flash units rather than exactly ane unit. For example, for two-way replication the extent F0: 0:20K would hold upward replaced yesteryear F0/F0′:0:20K in addition to the extent F1:0:20K would hold upward replaced yesteryear F1/F1':0:20K.
When around lawsuit occurs that necessitates a alter inward the mapping --for example, when a flash unit of measurement fails, or when the tail of the log moves yesteryear the electrical current active range-- a novel projection (a novel thought with a novel epoch number) has to hold upward installed on all clients inward the system.
To maintain in addition to reconfigure this mapping, Corfu uses VPaxos. There is a mapping from logical log to physical SSD extents/ranges. VPaxos keeps that mapping, in addition to updates that mapping on failures, in addition to on extent full.
This VPaxos-based auxiliary-driven reconfiguration involves ii distinct steps:
1. Sealing the electrical current projection: When a customer Cr decides to reconfigure the organisation from the electrical current projection Pi to a novel projection Pi+1, it showtime seals Pi; this involves sending a seal ascendance to a subset of the flash units inward Pi. Sealing ensures that flash units volition turn down in-flight messages --writes equally good equally reads-- sent to them inward the context of the sealed projection.
2. Writing the novel projection at the VPaxos box: Once the reconfiguring customer Cr has successfully sealed the electrical current projection Pi, it attempts to write the novel projection Pi+1 at the (i + 1)th seat inward the VPaxos box. If another customer has already written to that position, customer Cr aborts its ain reconfiguration, reads the existing projection at seat (i + 1), in addition to uses it equally its novel electrical current projection.
To read from the replica set, clients become to the terminal unit of measurement of the chain. If the terminal unit of measurement has non yet been updated, it volition render an mistake unwritten.
To have total holes (which is of import for RSM maintenance from log), the customer starts yesteryear checking the showtime unit of measurement of the chain to determine if a valid value exists inward the prefix of the chain. If such a value exists, the customer walks downward the chain to discovery the showtime unwritten replica, in addition to and thence completes the append yesteryear copying over the value to the remaining unwritten replicas inward chain order. Alternatively, if the showtime unit of measurement of the chain is unwritten, the customer writes the junk value to all the replicas inward chain order.
I am non electrical current on my SSD knowledge. The newspaper makes usage of properties of Flash SSDs: it assumes specific mistake codes to hold upward returned for "no item", "item", in addition to "junk", in addition to inward effect, it treats the SSDs equally write-once registers for the purpose of the log. In return, it besides tries to line of piece of work organisation human relationship for around of its limitations similar uneven habiliment problem, in addition to tries to load-balance the wear.
Did anything alter inward the trend SSDs piece of work that alter these assumptions/requirements?
2. How tin nosotros better on around drawbacks?
A big drawback inward Corfu is that whatever fourth dimension a fault occurs, everything stalls in addition to a reconfiguration is performed earlier reads/writes tin croak along on the active extent. This is besides a work with chain replication based protocols inward general.
Would at that topographic point hold upward around unproblematic solutions to amend Corfu to address this?
For example, would it hold upward possible to come upward up with a to a greater extent than clever, unmarried node shell tolerant mapping? Ceph had a clever hierarchical hashing called Crush, mayhap something along those lines.
As I convey mentioned inward the previous weblog post, MAD questions, Cosmos DB has operationalized a fault-masking streamlined version of replication via nested replica-sets deployed inward fan-out topology. Rather than doing offline updates from a log, Cosmos DB updates database at the replicas online, inward place, to furnish potent consistent in addition to bounded-staleness consistency reads with other read levels. On the other hand, Cosmos DB besides maintains a alter log yesteryear trend of a witness replica, which serves several useful purposes, including fault-tolerance, remote storage, in addition to snapshots for analytic workload.
Dynamo/Cassandra/Voldemort replication, but linearizability is yet guaranteed.
Previously I had summarized the Tango paper, for maintaining distributed information structures over a shared log. Tango builds on the Corfu log abstraction.
Corfu involves 3 principal functions:
- A mapping business office (maintained at the VPaxos box) from logical positions inward the log to flash pages on the cluster of flash units
- A tail-finding machinery (using a sequencer node) for finding the adjacent available logical seat on the log for novel data
- A replication protocol (chain replication!) to write a log entry consistently on multiple flash pages
Mapping inward Corfu
Each Corfu customer maintains a local, read-only replica of a information construction called a projection that carves the logical log into disjoint ranges. Each such arrive at is mapped to a listing of extents inside the address spaces of private flash units.The instance higher upward maps each log seat to a unmarried flash page; for replication, each extent is associated with a replica laid of flash units rather than exactly ane unit. For example, for two-way replication the extent F0: 0:20K would hold upward replaced yesteryear F0/F0′:0:20K in addition to the extent F1:0:20K would hold upward replaced yesteryear F1/F1':0:20K.
When around lawsuit occurs that necessitates a alter inward the mapping --for example, when a flash unit of measurement fails, or when the tail of the log moves yesteryear the electrical current active range-- a novel projection (a novel thought with a novel epoch number) has to hold upward installed on all clients inward the system.
To maintain in addition to reconfigure this mapping, Corfu uses VPaxos. There is a mapping from logical log to physical SSD extents/ranges. VPaxos keeps that mapping, in addition to updates that mapping on failures, in addition to on extent full.
This VPaxos-based auxiliary-driven reconfiguration involves ii distinct steps:
1. Sealing the electrical current projection: When a customer Cr decides to reconfigure the organisation from the electrical current projection Pi to a novel projection Pi+1, it showtime seals Pi; this involves sending a seal ascendance to a subset of the flash units inward Pi. Sealing ensures that flash units volition turn down in-flight messages --writes equally good equally reads-- sent to them inward the context of the sealed projection.
2. Writing the novel projection at the VPaxos box: Once the reconfiguring customer Cr has successfully sealed the electrical current projection Pi, it attempts to write the novel projection Pi+1 at the (i + 1)th seat inward the VPaxos box. If another customer has already written to that position, customer Cr aborts its ain reconfiguration, reads the existing projection at seat (i + 1), in addition to uses it equally its novel electrical current projection.
Finding tail inward Corfu
To eliminate arguing at the tail of the log, Corfu uses a dedicated sequencer that assigns clients 'tokens', corresponding to empty log positions. To append data, a customer showtime goes to the sequencer, which returns its electrical current value in addition to increments itself. The sequencer is only an optimization to cut back arguing inward the organisation in addition to is non required for either security or progress.Replication inward Corfu
Corfu uses a unproblematic chaining protocol (a client-driven variant of Chain Replication) to attain safety-under-contention in addition to durability. When a customer wants to write to a replica laid of flash pages, it updates them inward a deterministic replica order, waiting for each flash unit of measurement to response earlier moving to the adjacent one. If ii clients endeavor to concurrently update the same replica laid of flash pages, ane of them volition brand it 2nd at the showtime unit of measurement of the chain in addition to have an mistake overwrite.To read from the replica set, clients become to the terminal unit of measurement of the chain. If the terminal unit of measurement has non yet been updated, it volition render an mistake unwritten.
To have total holes (which is of import for RSM maintenance from log), the customer starts yesteryear checking the showtime unit of measurement of the chain to determine if a valid value exists inward the prefix of the chain. If such a value exists, the customer walks downward the chain to discovery the showtime unwritten replica, in addition to and thence completes the append yesteryear copying over the value to the remaining unwritten replicas inward chain order. Alternatively, if the showtime unit of measurement of the chain is unwritten, the customer writes the junk value to all the replicas inward chain order.
MAD questions
1. Do SSDs yet piece of work this way?I am non electrical current on my SSD knowledge. The newspaper makes usage of properties of Flash SSDs: it assumes specific mistake codes to hold upward returned for "no item", "item", in addition to "junk", in addition to inward effect, it treats the SSDs equally write-once registers for the purpose of the log. In return, it besides tries to line of piece of work organisation human relationship for around of its limitations similar uneven habiliment problem, in addition to tries to load-balance the wear.
Did anything alter inward the trend SSDs piece of work that alter these assumptions/requirements?
2. How tin nosotros better on around drawbacks?
A big drawback inward Corfu is that whatever fourth dimension a fault occurs, everything stalls in addition to a reconfiguration is performed earlier reads/writes tin croak along on the active extent. This is besides a work with chain replication based protocols inward general.
Would at that topographic point hold upward around unproblematic solutions to amend Corfu to address this?
For example, would it hold upward possible to come upward up with a to a greater extent than clever, unmarried node shell tolerant mapping? Ceph had a clever hierarchical hashing called Crush, mayhap something along those lines.
As I convey mentioned inward the previous weblog post, MAD questions, Cosmos DB has operationalized a fault-masking streamlined version of replication via nested replica-sets deployed inward fan-out topology. Rather than doing offline updates from a log, Cosmos DB updates database at the replicas online, inward place, to furnish potent consistent in addition to bounded-staleness consistency reads with other read levels. On the other hand, Cosmos DB besides maintains a alter log yesteryear trend of a witness replica, which serves several useful purposes, including fault-tolerance, remote storage, in addition to snapshots for analytic workload.
0 Response to "Paper Summary. Corfu: A Shared Log Blueprint For Flash Clusters"
Post a Comment