Retrospective Lightweight Distributed Snapshots Using Loosely Synchronized Clocks
This is a summary of our recent function that appeared at ICDCS 2017. The tool nosotros developed, Retroscope (available on GitHub), enables unplanned retrospective consistent global snapshots for distributed systems.
Many distributed systems would exercise goodness from the mightiness to own got unplanned snapshots of the past. Let's say Alice notices alarms going off for her distributed organization deployment at 4:05pm. If she could roll-back to the state of the distributed organization at 4:00pm, together with gyre forwards footstep yesteryear footstep to figure out what caused the problems, she may move able to remedy the problem.
The mightiness to own got retrospective snapshots requires each node to maintain a log of state changes together with and then to collate/align these logs to build a consistent cutting at a given time. However, clock uncertainty/skew alongside nodes is unsafe together with tin Pb to taking an inconsistent snapshot. For example, the cutting at 4:00pm inward this figure using NTP is inconsistent, because lawsuit F is included inward the cut, but causally preceding lawsuit eastward is non included.
To avoid this work with physical clock synchronization, our snapshotting service Retroscope employs Hybrid Logical Clocks (HLC), that combine the benefits of physical fourth dimension along with causality tracking of logical clocks (LC). By leveraging HLC, our organization tin own got non-blocking unplanned/retrospective consistent snapshots of distributed systems with affinity to physical time.
The figure shows HLC inward operation, with HLC timestamps inward green, together with the machine’s physical fourth dimension inward red. Note how message central betwixt P1 together with P2 bumps P2’s starting fourth dimension HLC constituent to move higher than its ain physical clock.
HLC non alone keeps runway of physical time, but also provides the same guarantees equally the LC. Namely, HLC satisfies the LC condition: if e hb f together with then HLC.e < HLC.f. These HLC properties allows us to position consistent cuts the same means nosotros position them with LC: a cutting is consistent when all events own got the same timestamp. In situations where nosotros own got no lawsuit with desired timestamp, nosotros tin exercise a phantom, non-mutating lawsuit at the desired timestamp together with purpose it to position the consistent cut.
Our tool Retroscope also supports incremental snapshots to own got an cheap snapshot inward the vicinity of some other snapshot yesteryear undoing (or redoing) events to accomplish the novel one. These incremental snapshots are rattling useful for monitoring tasks, inward which nosotros demand to explore many yesteryear organization states inward a step-through mode piece searching for some invariant violation or error causes.
We own got implemented Retroscope equally a ready of libraries to move added to existing Java projects to enable retrospective snapshots capabilities. Our Retroscope library provides tools to log together with conk on runway of state changes together with HLC fourth dimension of such changes independently at each node. The electrical flow implementation uses in-memory sliding-window log for state history, together with own got configurable capacity. The Retroscope server library provides the tools to aggregate multiple such lawsuit logs together with position consistent cuts across those logs with affinity to physical time. The API also allows querying for consistent cuts that satisfy for certain predicates using a SQL-like querying language.
We volition verbalise nearly Retroscope's querying together with monitoring features inward a subsequently weblog post.
Retroscoping Voldemort took less than grand lines of code for adding HLC to the network protocol, recording changes inward the Retroscope window-log, together with performing snapshot on the Voldemort's storage. To quantify the overhead caused yesteryear Retroscope, nosotros compared the performance of our modified Voldemort with unmodified version on a 10-node cluster. The figure below illustrates the throughput degradation resulted from adding Retroscope to the system. We observed that inward the worst case, the degradation was but about 10%, nonetheless inward some cases nosotros also observed no degradation at all. Latency overheads were similar to the throughput.
Taking a snapshot brings to a greater extent than stress to Voldemort, a disk-based system, equally each node similar a shot needs to larn a re-create of it local state together with undo changes from that copy. Retroscope allows for the entire snapshot routine to run inward non-blocking manner, but with some performance degradation. The figure below illustrate the throughput together with latency of the customer requests on the 10-node cluster piece performing a snapshot on a database of 1,000,000 items. Average throughput degradation of the organization piece taking a snapshot was 18%, although the performance tin move improved yesteryear using a assort disk to brand a database copy.
We also ran a similar experiment on Hazelcast in-memory datagrid, together with observed rattling fiddling performance degradation associated with the snapshot, since Hazelcast is in-memory system.
(This was a articulation ship service with Aleksey Charapko.)
Many distributed systems would exercise goodness from the mightiness to own got unplanned snapshots of the past. Let's say Alice notices alarms going off for her distributed organization deployment at 4:05pm. If she could roll-back to the state of the distributed organization at 4:00pm, together with gyre forwards footstep yesteryear footstep to figure out what caused the problems, she may move able to remedy the problem.
The mightiness to own got retrospective snapshots requires each node to maintain a log of state changes together with and then to collate/align these logs to build a consistent cutting at a given time. However, clock uncertainty/skew alongside nodes is unsafe together with tin Pb to taking an inconsistent snapshot. For example, the cutting at 4:00pm inward this figure using NTP is inconsistent, because lawsuit F is included inward the cut, but causally preceding lawsuit eastward is non included.
To avoid this work with physical clock synchronization, our snapshotting service Retroscope employs Hybrid Logical Clocks (HLC), that combine the benefits of physical fourth dimension along with causality tracking of logical clocks (LC). By leveraging HLC, our organization tin own got non-blocking unplanned/retrospective consistent snapshots of distributed systems with affinity to physical time.
Hybrid Logical Clocks
Each HLC timestamp is a two-part tuple: the starting fourth dimension business office is a shadow-copy of NTP time, together with the minute business office is an overflow buffer used when multiple events own got identical starting fourth dimension parts inward their HLC timestamp. The starting fourth dimension business office of HLC at a node is maintained equally the highest NTP fourth dimension that the node is aware of. (This comes from the node's ain physical clock, or comes from a remote node from which a message was received recently.) The minute business office acts equally the logical clock for events with identical starting fourth dimension parts, together with it is existence reset every fourth dimension the starting fourth dimension business office is updated.The figure shows HLC inward operation, with HLC timestamps inward green, together with the machine’s physical fourth dimension inward red. Note how message central betwixt P1 together with P2 bumps P2’s starting fourth dimension HLC constituent to move higher than its ain physical clock.
HLC non alone keeps runway of physical time, but also provides the same guarantees equally the LC. Namely, HLC satisfies the LC condition: if e hb f together with then HLC.e < HLC.f. These HLC properties allows us to position consistent cuts the same means nosotros position them with LC: a cutting is consistent when all events own got the same timestamp. In situations where nosotros own got no lawsuit with desired timestamp, nosotros tin exercise a phantom, non-mutating lawsuit at the desired timestamp together with purpose it to position the consistent cut.
Taking a Snapshot
An initiating agent tin laid about the snapshot physical care for yesteryear sending a snapshot asking to every node. Each node, upon receiving the asking performs a local snapshot of its electrical flow state, together with uses the window-logs of state changes to undo the changes until the state reaches the requested HLC time. Each node performs snapshot independently, together with at that topographic point is no demand for inter-node coordination on taking the snapshot. Once all machines own got arrived to local snapshots at the same HLC time, nosotros own got obtained a global distributed snapshot (by virtue of the HLC characteristic stated above).Our tool Retroscope also supports incremental snapshots to own got an cheap snapshot inward the vicinity of some other snapshot yesteryear undoing (or redoing) events to accomplish the novel one. These incremental snapshots are rattling useful for monitoring tasks, inward which nosotros demand to explore many yesteryear organization states inward a step-through mode piece searching for some invariant violation or error causes.
We own got implemented Retroscope equally a ready of libraries to move added to existing Java projects to enable retrospective snapshots capabilities. Our Retroscope library provides tools to log together with conk on runway of state changes together with HLC fourth dimension of such changes independently at each node. The electrical flow implementation uses in-memory sliding-window log for state history, together with own got configurable capacity. The Retroscope server library provides the tools to aggregate multiple such lawsuit logs together with position consistent cuts across those logs with affinity to physical time. The API also allows querying for consistent cuts that satisfy for certain predicates using a SQL-like querying language.
We volition verbalise nearly Retroscope's querying together with monitoring features inward a subsequently weblog post.
Evaluation
We own got added Retroscope capabilities to several Java applications, such equally Voldemort key-value database, Hazelcast inward retentiveness data-grid, together with ZooKeeper.Retroscoping Voldemort took less than grand lines of code for adding HLC to the network protocol, recording changes inward the Retroscope window-log, together with performing snapshot on the Voldemort's storage. To quantify the overhead caused yesteryear Retroscope, nosotros compared the performance of our modified Voldemort with unmodified version on a 10-node cluster. The figure below illustrates the throughput degradation resulted from adding Retroscope to the system. We observed that inward the worst case, the degradation was but about 10%, nonetheless inward some cases nosotros also observed no degradation at all. Latency overheads were similar to the throughput.
Taking a snapshot brings to a greater extent than stress to Voldemort, a disk-based system, equally each node similar a shot needs to larn a re-create of it local state together with undo changes from that copy. Retroscope allows for the entire snapshot routine to run inward non-blocking manner, but with some performance degradation. The figure below illustrate the throughput together with latency of the customer requests on the 10-node cluster piece performing a snapshot on a database of 1,000,000 items. Average throughput degradation of the organization piece taking a snapshot was 18%, although the performance tin move improved yesteryear using a assort disk to brand a database copy.
We also ran a similar experiment on Hazelcast in-memory datagrid, together with observed rattling fiddling performance degradation associated with the snapshot, since Hazelcast is in-memory system.
(This was a articulation ship service with Aleksey Charapko.)
0 Response to "Retrospective Lightweight Distributed Snapshots Using Loosely Synchronized Clocks"
Post a Comment