Holistic Configuration Management At Facebook

Move fast in addition to interruption things

That was Facebook's famous mantra for developers. Facebook believes inwards getting early on feedback in addition to iterating rapidly, in addition to then it releases software early on in addition to frequently: Three times a solar daytime for frontend code, iii times a solar daytime for for backend code. I am amazed that Facebook is able to maintain such an agile deployment procedure at that scale. I receive got heard other software companies, fifty-fifty relatively immature ones, educate problems alongside agility, fifty-fifty to the betoken that deploying a piffling app would receive got couplet months due to reviews, filing tickets, routing, permissions, etc.

Of course of written report when deploying that ofttimes at that scale you lot postulate dependent champaign in addition to proficient processes inwards social club to preclude chaos. In his F8 Developers conference inwards 2014, Zuckerberg announced the novel Facebook motto "Move Fast With Stable Infra."

I intend the configerator tool discussed inwards this newspaper is a large component subdivision of the "Stable Infra". (By the way, why is it configerator but non configurator? Another Facebook peculiarity similar the spelling of Übertrace?)

Here is a link to the newspaper inwards SOSP'15 proceedings.
Here is a link to the conference presentation video.

Configuration management

What is fifty-fifty to a greater extent than surprising than daily Facebook code deployment is this: Facebook's various configurations are changed fifty-fifty to a greater extent than frequently, currently thousands of times a day. And concur fast: every unmarried engineer tin brand alive configuration changes! This is sure exceptional peculiarly considering that fifty-fifty a youngster error could potentially campaign a site-wide outage (due to complex interdependencies). How is this possible without incurring chaos?

The response is: Discipline sets you lot free. By existence disciplined virtually the deployment process, past times having built the configerator, Facebook lowers the risks for deployments in addition to tin give liberty to its developers to deploy frequently.

Ok, before reviewing this cool arrangement configerator, let's acquire this clarified first: what does configuration management involve in addition to where is it needed? It turns out it is essential for many in addition to various laid of systems at Facebook. These include: gating novel production features, conducting experiments (A/B tests), performing application-level traffic control, performing topology setup in addition to charge balancing at TAO, performing monitoring alerts/remediation, updating machine learning models (which varies from KBs to GBs), controlling applications' behaviors (e.g., how much retentiveness is reserved for caching, how many writes to batch before writing to the disk, how much information to prefetch on a read).

Essentially configuration management provides the knobs that enable tuning, adjusting, in addition to controlling Facebook's systems. No wonder configuration changes maintain growing inwards frequency in addition to outdo code changes past times orders of magnitudes.



Configuration every bit code approach

The configerator philosophy is treating configuration every bit code, that is compiling in addition to generating configs from high-level rootage code. Configerator stores the config programs in addition to the generated configs inwards the git version control.


There tin hold upwardly complex dependencies across systems services inwards Facebook: after i subsystem/service config is updated to enable a novel feature, the configs of all other systems powerfulness postulate hold upwardly updated accordingly. By taking a configuration every bit code approach, configerator automatically extracts dependencies from rootage code without the postulate to manually edit a makefile. Furthermore, Configerator provides many other foundational functions, including version control, authoring, code review, automated canary testing, in addition to config distribution. We volition review these adjacent every bit component subdivision of the Configerator architecture discussion.


While configerator is the chief tool, in that place are other configuration back upwardly tools inwards the suite.
Gatekeeper controls the rollouts of novel production features. Moreover, it tin also run A/B testing experiments to discovery the best config parameters. In add-on to Gatekeeper, Facebook has other A/B testing tools built on top of Configerator, but nosotros omit them inwards this newspaper due to the infinite limitation. PackageVessel uses peer-to-peer file transfer to assistance the distribution of large configs (e.g., GBs of machine learning models), without sacrificing the consistency guarantee. Sitevars is a shim layer that provides an easy-to-use configuration API for the frontend PHP products. MobileConfig manages mobile apps' configs on Android in addition to iOS, in addition to bridges them to the backend systems such every bit Configerator in addition to Gatekeeper. MobileConfig is non bridged to Sitevars because Sitevars is for PHP only. MobileConfig is non bridged to PackageVessel because currently in that place is no postulate to transfer real large configs to mobile devices.

The P2P file transfer mentioned every bit component subdivision of PackageVessel is none other than BitTorrent. Yes, BitTorrent finds many applications inwards the datacenter. This illustration from Twitter inwards 2010.

The Configerator architecture


The Configerator application is designed to defend against configuration errors using many phases. "First, the configuration compiler automatically runs developer-provided validators to verify invariants defined for configs. Second, a config modify is treated the same every bit a code modify in addition to goes though the same rigorous code review process. Third, a config modify that affects the frontend products automatically goes through continuous integration tests inwards a sandbox. Lastly, the automated canary testing tool rolls out a config modify to production inwards a staged fashion, monitors the arrangement health, in addition to rolls dorsum automatically inwards instance of problems."

I intend this architecture is genuinely quite simple, fifty-fifty though it may await complex.  Both command in addition to information are flowing the same direction: top to down. There are no cyclic dependencies which tin brand recovery hard. This is a soft-state architecture. New in addition to right information pushed from top, volition cleans erstwhile in addition to bad information.

Canary testing: The proof is inwards the pudding

The newspaper has this to say on their canary testing:
The canary service automatically tests a novel config on a subset of production machines that serve alive traffic. It com- plements manual testing in addition to automated integration tests. Manual testing tin execute tests that are difficult to automate, but may missy config errors due to oversight or shortcut nether fourth dimension pressure. Continuous integration tests inwards a sandbox tin receive got broad coverage, but may missy config errors due to the small-scale setup or other environs differences. A config is associated alongside a canary spec that describes how to automate testing the config inwards production. The spec defines multiple testing phases. For example, inwards stage 1, seek on twenty servers; inwards stage 2, seek inwards a total cluster alongside thousands of servers. For each phase, it specifies the testing target servers, the healthcheck metrics, in addition to the predicates that create upwardly one's hear whether the seek passes or fails. For example, the click-through charge per unit of measurement (CTR) collected from the servers using the novel config should non hold upwardly to a greater extent than than x% lower than the CTR collected from the servers nonetheless using the erstwhile config.
Canary testing is an end-to-end test, in addition to it somewhat overrides trying to construct to a greater extent than exhaustive static tests on configs. Of course of written report the validation, review, in addition to sandbox tests are of import precautions to seek to brand sure the config is sane before it is tried inwards small-scale amount inwards production. However, given that Facebook already has canary testing, it is a proficient halt proof for correctness of the config, in addition to this somewhat obviates the postulate for heavyweight correctness checking mechanisms. The newspaper gives couplet examples of problems caught during canary testing.

On the other hand, the newspaper does non brand it clear how conclusive/exhaustive are the canary tests. What if canary tests don't select grip of tardily manifesting errors, similar retentiveness leaks. Also, how does Facebook discovery whether in that place are  abnormality during a canary test? Yes, In his F8 Developers conference inwards 2014, Zuckerberg announced the novel Facebook motto "Move Fast With Stable Infra."

I intend the configerator tool discussed inwards this newspaper is a large component subdivision of the "Stable Infra". (By the way, why is it configerator but non configurator? Another Facebook peculiarity similar the spelling of Facebook has monitoring tools (ubertrace in addition to mystery machine) but are they sufficient for abnormality detection in addition to subtle põrnikas detection? Maybe nosotros don't come across adverse trial of configuration modify for this application, but what if it adversely affected other applications, or backend services. It seems similar an exhaustive monitoring, log collection, in addition to log analysis may postulate to hold upwardly done to discovery to a greater extent than subtle errors.

Performance of the Configerator

Here are jurist latencies for configerator phases:
When an engineer saves a config change, it takes virtually x minutes to become through automated canary tests. This long testing fourth dimension is needed inwards social club to reliably determine whether the application is well for you lot nether the novel config. After ca- nary tests, how long does it receive got to commit the modify in addition to propagate it to all servers subscribing to the config? This la- tency tin hold upwardly broken downwards into iii parts: 1) It takes virtually v seconds to commit the modify into the shared git repository, because git is boring on a large repository; 2) The git tailer (see Figure 3) takes virtually v seconds to fetch config changes from the shared git repository; 3) The git tailer writes the modify to Zeus, which propagates the modify to all subscribing servers through a distribution tree. The terminal measuring takes virtually 4.5 seconds to accomplish hundreds of thousands of servers distributed across multiple continents.

This figure from the newspaper demo that git is the bottleneck for configuration distribution. "The commit throughput is non scalable alongside observe to the repository size, because the execution fourth dimension of many git operations increases alongside the release of files inwards the repository in addition to the depth of the git history.  Configerator is inwards the procedure of migration to multiple smaller git repositories that collectively serve a partitioned global hollo space."

Where is the research?

Configerator is an impressive technology scientific discipline effort, in addition to I desire to  focus on what are the of import question receive got aways from this. Going forward, what are the centre question ideas in addition to findings? How tin nosotros force the envelope for future-facing improvements?

How consistent should the configuration rollouts be?
There tin hold upwardly couplings/conflicts betwixt  code in addition to configuration. Facebook solves this cleverly. They deploy code first, much before than the config, in addition to enable the hidden/latent code afterward alongside the config change. There tin also hold upwardly couplings/conflicts betwixt erstwhile in addition to novel configs. The configuration modify arrives at production servers at dissimilar times, albeit inside 5-10 seconds of each other. Would it campaign problems to receive got about servers run erstwhile configuration, about novel configuration? Facebook punts this responsibleness to the developers, they postulate to brand sure that novel config tin coexist alongside erstwhile config inwards peace. After all they utilization canary testing where fraction of machines utilization novel config, remaining the erstwhile config. So, inwards sum, Facebook does non seek to receive got a strong consistent reset to the novel config. I don't know the details of their system, but for backend servers config changes may postulate stronger consistency than that.

Push versus Pull debate.
The newspaper claims force is to a greater extent than advantageous than draw inwards the datacenter for config deployment. I am non convinced because the arguments practice non await strong.
Configerator uses the force model. How does it compare alongside the draw model? The biggest payoff of the draw model is its simplicity inwards implementation, because the server side tin hold upwardly stateless, without storing whatsoever difficult terra firma virtually private clients, e.g., the laid of configs needed past times each customer (note that dissimilar machines may run dissimilar applications in addition to therefore postulate dissimilar configs). However, the draw model is less efficient for ii reasons. First, about polls render no novel information in addition to therefore are pure overhead. It is difficult to determine the optimal poll frequency. Second, since the server side is stateless, the customer has to include inwards each poll the total listing of configs needed past times the client, which is non scalable every bit the release of configs grows. In our environment, many servers postulate tens of thousands of configs to run. We opt for the force model inwards our environment.
This may hold upwardly worth revisiting in addition to investigating inwards to a greater extent than detail. Pull is uncomplicated in addition to stateless every bit they also mention, in addition to it is unclear why it couldn't hold upwardly adopted.

How practice nosotros extend to WAN?
All coordination mentioned is unmarried original (i.e., unmarried producer/writer). Would in that place hold upwardly a postulate for multi original solution, a original at each region/continent that tin start a config update? Then the arrangement shall postulate to bargain alongside concurrent in addition to potentially conflicting configuration changes.  However, given that canary testing is on the social club of minutes, in that place would non hold upwardly a practical postulate for multi-master deployment inwards the nigh future.

Reviews of other Facebook papers

In his F8 Developers conference inwards 2014, Zuckerberg announced the novel Facebook motto "Move Fast With Stable Infra."

I intend the configerator tool discussed inwards this newspaper is a large component subdivision of the "Stable Infra". (By the way, why is it configerator but non configurator? Another Facebook peculiarity similar the spelling of Facebook's Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services

In his F8 Developers conference inwards 2014, Zuckerberg announced the novel Facebook motto "Move Fast With Stable Infra."

I intend the configerator tool discussed inwards this newspaper is a large component subdivision of the "Stable Infra". (By the way, why is it configerator but non configurator? Another Facebook peculiarity similar the spelling of Facebook's software architecture

In his F8 Developers conference inwards 2014, Zuckerberg announced the novel Facebook motto "Move Fast With Stable Infra."

I intend the configerator tool discussed inwards this newspaper is a large component subdivision of the "Stable Infra". (By the way, why is it configerator but non configurator? Another Facebook peculiarity similar the spelling of Scaling Memcache at Facebook

In his F8 Developers conference inwards 2014, Zuckerberg announced the novel Facebook motto "Move Fast With Stable Infra."

I intend the configerator tool discussed inwards this newspaper is a large component subdivision of the "Stable Infra". (By the way, why is it configerator but non configurator? Another Facebook peculiarity similar the spelling of Finding a Needle inwards Haystack: Facebook's Photo Storage

In his F8 Developers conference inwards 2014, Zuckerberg announced the novel Facebook motto "Move Fast With Stable Infra."

I intend the configerator tool discussed inwards this newspaper is a large component subdivision of the "Stable Infra". (By the way, why is it configerator but non configurator? Another Facebook peculiarity similar the spelling of Finding a Needle inwards Haystack: Facebook's Photo Storage

0 Response to "Holistic Configuration Management At Facebook"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel