Facebook's Mystery Machine: End-To-End Functioning Analysis Of Large-Scale Network Services

This newspaper appeared inward OSDI'14, in addition to is authored past times Michael Chow, University of Michigan; David Meisner, Facebook, Inc.; Jason Flinn, University of Michigan; Daniel Peek, Facebook, Inc.; Thomas F. Wenisch, University of Michigan.

The destination of this newspaper is rattling similar to that of Google Dapper (you tin read my summary of Google Dapper here). Both move essay to figure out bottlenecks inward functioning inward high fanout large-scale Internet services. Both move role similar methods, withal this move (the mystery machine) tries to attain the labor relying on less instrumentation than Google Dapper. The novelty of the mystery auto move is that it tries to infer the constituent telephone phone graph implicitly via mining the logs, where every bit Google Dapper instrumented each telephone phone inward a meticulous mode in addition to explicitly obtained the entire telephone phone graph.

The motivation for this approach is that comprehensive instrumentation every bit inward Google Dapper requires standardization....and I am quoting the balance from the paper:
[Facebook systems] grow organically over fourth dimension inward a civilization that favors excogitation over standardization (e.g., "move fast in addition to intermission things" is a well-known Facebook slogan). There is wide variety inward programming languages, communication middleware, execution environments, in addition to scheduling mechanisms. Adding instrumentation retroactively to such an infrastructure is a Herculean task. Further, the end-to-end pipeline includes customer software such every bit Web browsers, in addition to adding detailed instrumentation to all such software is non feasible.

While the newspaper says it doesn't desire to interfere amongst the instrumentation, of course of report it has to interfere to flora a minimum measure inward the resulting collection of private software constituent logs, which they telephone phone UberTrace. (Can you lot let out a to a greater extent than Facebooky cite than UberTrace---which the newspaper spells every bit ÜberTrace, but I spare you lot here---?)
UberTrace requires that log messages comprise at least:
1. A unique asking identifier.
2. The executing figurer (e.g., the customer or a special server)
3. A timestamp that uses the local clock of the executing computer
4. An lawsuit cite (e.g., "start of DOM rendering").
5. A labor name, where a labor is defined to endure a distributed thread of control.
In companionship non to incur a lot of overhead, UberTrace uses a depression sampling charge per unit of measurement of all requests to Facebook. But this necessitates some other requirement on the logging:
UberTrace must ensure that the private logging systems select the same laid of requests to monitor; otherwise the probability of all logging systems independently choosing to monitor the same asking would endure vanishingly small, making it infeasible to fix a detailed painting present of end-to-end latency. Therefore, UberTrace propagates the conclusion most whether or non to monitor a asking from the initial logging constituent that makes such a conclusion through all logging systems along the path of the request, ensuring that the asking is completely logged. The conclusion to log a asking is made when the asking is received at the Facebook Web server; the conclusion is included every bit business office of the per-request metadata that is read past times all subsequent components. UberTrace uses a global identifier to collect the private log messages, extracts the information items enumerated above, in addition to stores each message every bit a tape inward a relational database.

The mystery machine

To infer the telephone phone graph from the logs, the mystery auto starts amongst a telephone phone graph hypothesis in addition to refines it gradually every bit each log delineate provides some counterexample. Figure 1 in addition to Figure 2 explicate how the mystery auto generates the model via large scale mining of UberTrace.


For the analysis inward the paper, they role traces of over 1.3 i grand 1000 requests to the Facebook habitation page gathered over xxx days. Was the sampling charge per unit of measurement enough, statistically meaningful? Figure three says yes.

We know that for large scale Internet services, a unmarried asking may invoke 100s of (micro)services, in addition to that many services tin Pb to 80K-100K relationships every bit shown inward Figure 3. But it was soundless surprising to come across that it took 400K traces for the telephone phone graph to start to converge to its in conclusion form. That must endure i heck of a convoluted spaghetti of services.

Findings

The mystery auto analysis is performed past times running parallel Hadoop jobs.

Figure five is why critical path identification is important. Check the ratios on the correct side.


How tin nosotros role this analysis to amend Facebook's performance?

As Figure ix showed, some users/requests accept "slack" (another technical term this newspaper introduced). For the users/requests amongst slack, the server fourth dimension constitutes exclusively a rattling modest fraction of the critical path, which the network- in addition to client-side latencies dominate.

And at that spot are also a category of users/requests amongst no slack. For those, the server fourth dimension dominates the critical path, every bit the network- in addition to client-side latencies are rattling low.

This suggests a potential functioning improvement past times offering differentiated service based on the predicted amount of slack available per connection:
By using predicted slack every bit a scheduling deadline, nosotros tin amend average reply fourth dimension inward a mode similar to the earliest deadline kickoff real-time scheduling algorithm. Connections amongst considerable slack tin endure given a lower priority without affecting end-to-end latency. However, connections amongst niggling slack should come across an improvement inward end-to-end latency because they are given scheduling priority. Therefore, average latency should improve. We accept also shown that prior slack values are a proficient predictor of time to come slack [Figure 11]. When novel connections are received, historical values tin endure retrieved in addition to used inward scheduling decisions. Since calculating slack is much less complex than servicing the actual Facebook request, it should endure viable to recalculate the slack for each user or hence i time per month.

Some limitations of the mystery machine 

This approach assumes that the telephone phone graph is acyclic. With their asking id based logging, they cannot have got the same event, labor brace to appear multiple times for the same asking trace.

This approach requires normalizing/synchronizing local clock timestamps across computers. It seems similar they are doing offline post-hoc clock synchronization past times leveraging the RPC calls. (Does that hateful farther instrumentation of the RPC calls?)
Since all log timestamps are inward relation to local clocks, UberTrace translates them to estimated global clock values past times compensating for clock skew. UberTrace looks for the mutual RPC designing of communication inward which the thread of command inward an private labor passes from i figurer (called the customer to simplify this explanation) to another, executes on the minute figurer (called the server), in addition to returns to the client. UberTrace calculates the server execution fourth dimension past times subtracting the latest in addition to earliest server timestamps (according to the server's local clock) nested inside the customer RPC. It hence calculates the client-observed execution fourth dimension past times subtracting the customer timestamps that right away succeed in addition to precede the RPC. The divergence betwixt the customer in addition to server intervals is the estimated network round-trip fourth dimension (RTT) betwixt the customer in addition to server. By assuming that asking in addition to reply delays are symmetric, UberTrace calculates clock skew such that, later clock-skew adjustment, the kickoff server timestamp inward the designing is precisely 1/2 RTT later the previous customer timestamp for the task.
This move also did non consider mobile users; 1.19 billion of 1.39 billion users are mobile users.

Related links

Facebook's software architecture

Scaling Memcache at Facebook

Finding a needle inward Haystack: Facebook's photograph storage

Google Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

0 Response to "Facebook's Mystery Machine: End-To-End Functioning Analysis Of Large-Scale Network Services"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel