Retroscope: Retrospective Cut-Monitoring Of Distributed Systems (Part 3)

This post service continues the give-and-take on monitoring distributed systems alongside Retroscope. Here nosotros focus on cutting monitoring approach Retroscope uses. (This post service is jointly written with Aleksey Charapko as well as Ailidani Ailijiang.)

Retroscope is a monitoring organisation for exploring global/nonlocal patch history of a distributed system. It differs from other monitoring tools due to the agency it inspects the organisation state. While asking tracers inspect the organisation yesteryear next the describe of a asking (i.e. asking r inward the figure), Retroscope performs cutting monitoring as well as examines the organisation at consistent global cuts, observing the patch across many machines as well as requests. It moves along the organisation history as well as scans a progression of states ane cutting at a time, checking cut Ts1 as well as thus Ts2 as well as thus on.

Retroscope’s cutting monitoring approach is complementary to the asking tracing solutions, as well as brings a number of advantages. First, yesteryear exposing the nonlocal state, Retroscope enables users to bear witness nonlocal properties of distributed applications. Using Retroscope you lot tin inspect patch distributed across many machines as well as tin ground virtually the execution of a complex distributed application through invariant checking. Furthermore, yesteryear sifting through many yesteryear nonlocal states, you lot tin perform root-cause analysis as well as role the across-node context to diagnose race conditions, nonlocal patch inconsistencies, as well as nonlocal invariant violations.

To illustrate some of these benefits, nosotros role Retroscope as well as the Retroscope Query Language (RQL) to written report the information staleness of replica nodes inward a ZooKeeper cluster. Staleness is a non-local holding that cannot travel easily observed yesteryear other monitoring techniques. To our surprise, nosotros flora that fifty-fifty a unremarkably operating cluster tin possess got a large staleness. In ane of our observations inward AWS EC2, some ZooKeeper replicas were lagging yesteryear equally much equally 22 versions behind the residue of the cluster equally nosotros hash out at the terminate of this post.

Feasibility of Cut Monitoring

Ok, if cutting monitoring is thus useful why was this non done before? The answer is cutting monitoring was non really feasible. A measure agency to produce cutting monitoring is alongside vector clocks (VC), simply VC produce non scale good for large systems due to its O(N) infinite complexity. Moreover, using VC results inward identifying excess number of concurrent cuts for a given point, many of which are faux positives that produce non take away house inward actual organisation execution.

Retroscope employs hybrid logical clocks (HLC) as well as a scalable flow processing architecture to render a viable end-to-end solution for cutting monitoring. The NTP-synchronized physical clock cistron of HLC shrinks the number of consistent cuts at a given dot to solely 1. (It may travel argued that this reduces the theoretical coverage compared to VC, simply this a skillful tradeoff to accept to ameliorate functioning as well as avoid false-positives resulting from VC.) Using HLC also allows us to build consistent cuts without the demand to coordinate across nodes. Finally, the HLC size is constant, as well as this reduces the communication overheads. We talked virtually these advantages inward Part 1.

To accomplish a scalable implementation of Retroscope, nosotros leveraged Apache Ignite for flow processing, computation, as well as storage. We arranged the log ingestion inward a agency to minimize information displace as well as to ameliorate information locality as well as accomplish maximal parallelism when searching. We had covered these issues inward Part 2.

In our prototype, Retroscope processing deployed on ane quad-core server was processing over 150,000 consistent cuts per second. Horizontal scalability is ane of the strongholds of Retroscope’s architecture. Adding to a greater extent than compute power, allows Retroscope to redistribute the tasks evenly across all available servers as well as accomplish a nearly perfect speedup (93% going from iv to 8 servers).

Ok, at ane time dorsum to the ZooKeeper representative written report to demo the advantages cutting monitoring approach.

The ZooKeeper Case Study

Users interact alongside Retroscope via the declarative Retroscope Query Language (RQL). The users solely demand to specify the nonlocal predicates to search for, as well as leave of absence the residue for the organisation to figure out.

To illustrate Retroscope as well as RQL, nosotros considered the replica staleness monitoring inward Apache ZooKeeper a. In ZooKeeper, a customer tin read information from whatsoever unmarried replica, as well as if the replica is non fully up-to-date, the customer volition read stale data. The staleness is a nonlocal property, because it is defined yesteryear considering the states of other replicas at that same dot inward time. Using a unproblematic RQL query, nosotros tin divulge the cuts that violate normal (less than 2 versions) staleness behaviour of a cluster:
SELECT r1 FROM zklog
WHEN Max(r1) - Min (r1) > 1 ;
In this query, r1 is the version of a node’s state. The organisation retrospectively looks at yesteryear application states as well as search for the ones that satisfy this staleness predicate.

We observed many cuts having the staleness problem, alongside a few larger spike (up to 22 version stale!) that captured our attention. To investigate the causes for the excessive staleness cases, nosotros demand to inspect the message central inward the organisation at those points. Here is the enquiry nosotros role for that:
SELECT r1, sentCount, recvCount, diff, staleness
FROM zklog
COMPUTE
GLOBAL diff
AND GLOBAL staleness
AND (staleness := Max(r1) - Min (r1))
AND (diff:= NodeSum(sentCount) - NodeSum(recvCount))
AT TIME t1 TO t2

In this enquiry nosotros included some other nonlocal property: the number of messages inward transit betwixt nodes. The enquiry scans through yesteryear cuts some the fourth dimension of observed staleness nosotros identified earlier. This allows us to visualize both staleness as well as the number of messages beingness in-transit betwixt nodes inward the cluster. We meet that the staleness spikes at the same fourth dimension equally the number of “in-flight” messages increases.

The number of messages “stuck” inward the network tells us all the same solely a footling virtually the communication patterns inward the cluster. To gain to a greater extent than insight inward the message exchanges, nosotros await at the in-flight messages to a greater extent than rigorously as well as bear witness the sets of sent as well as received messages at each node alongside this query:
SELECT sentM, recvM, inFlight, r1, staleness
FROM zklog
COMPUTE
GLOBAL staleness
AND (staleness := Max(r1) - Min(r1))
AND GLOBAL inFlight
AND (inFlight := Flatten(sentM) \ Flatten(recvM))
AT TIME x TO y

We run this enquiry alongside a custom enquiry processor that visualizes the results equally a “heat-map” of message exchange. Here is an representative of how messages were flowing inward the organisation correct before as well as at the peak of the staleness event. The deeper bluish colouring represents greater number of messages beingness inward the network betwixt nodes. We meet to a greater extent than messages in-flight inward both directions betwixt nodes #3 (leader) as well as #4, suggesting that staleness is caused yesteryear messages beingness stuck in-transit betwixt these nodes for longer than usual. This indicates a possibility of a momentary millibottleneck inward the network betwixt the node #3 as well as node #4.