Paper Review. Sharding The Shards: Managing Datastore Locality At Scale Amongst Akkio

This newspaper past times Facebook, which appeared inwards OSDI'18, describes the information locality management service, Akkio. Akkio has been inwards production utilisation at Facebook since 2014. It manages over 100PB of data, in addition to processes over 10 1000000 information accesses per second.

Why exercise nosotros demand to deal locality? 

Replicating all information to all datacenters is hard to justify economically (due to the extra storage in addition to WAN networking costs) when acceptable durability in addition to asking serving latency could live achieved with 3 replicas. It looks similar Facebook had been doing total replication (at to the lowest degree for ViewState in addition to AccessState applications discussed inwards the evaluation) to all the 6 datacenters back-in-the-day, but every bit the functioning in addition to the number of datacenters grew, this became untenable.

So, let's uncovering suitable home-bases for data, instead of fully replicating it to all datacenters. But the employment is access locality is non static. What was a skillful location/configuration for the information ceases to acquire suitable when the information access patterns change. A modify inwards the requesting datacenter tin arise because the user travels from i part to another. But, meh, that is quondam hat, in addition to non really interesting. The newspaper mentions that the to a greater extent than probable argue for asking movements is that because service workloads are shifted from datacenters with high loads to others with lower loads inwards social club to lower service asking answer latencies. A diurnal peaking pattern is evident inwards the graphs. During peaks, upwards to 50% of requests may live shifted to remote datacenters.


Why are µ-shards the correct abstraction for managing locality?

The newspaper advocates for µ-shards (micro-shards), really fine grained datasets (from 1Kb to 1Mb), to serve every bit the unit of measurement of migration and  the abstraction for managing locality. A µ-shard contains multiple key-value pairs or database tabular array rows, in addition to should live chosen such that it exhibits potent access locality. Examples could live Facebook viewing history to inform subsequent content, user profile information, Instagram messaging queues, etc.

Why non shards, but µ-shards? Shards are for datastores, µ-shards are for applications. Shard sizes are laid past times administrators to remainder shard overhead, charge balancing, in addition to failure recovery, in addition to they tend to live on the social club of gigabytes. Since µ-shards are formed past times the customer application to refer to a working information laid size, they capture access locality better. They are likewise to a greater extent than amenable to migration. µ-shard migration has an overhead that is many social club of magnitude lower than that of shard migration, in addition to its utility is far higher. There is no demand to migrate 1GB partition, when the access is to a 1MB portion.

Enter Akkio

To address these, the newspaper introduces Akkio, a µ-shard based locality management service for distributed datastores. Akkio is layered betwixt customer applications in addition to the distributed datastore systems that implements sharding. It decides inwards which datacenter to identify in addition to how in addition to when to migrate information for reducing access latencies in addition to WAN communication. It helps direct each information access to where the target information is located, in addition to it tracks each access to live able to brand appropriate placement decisions.

In the ease of this review, nosotros verbalize over how Akkio leverages the underlying datastore to construct a locality management layer, the architecture in addition to components of that management layer, in addition to the evaluation results from deployment of Akkio at Facebook. As ever nosotros have got a MAD questions department at the halt to speculate in addition to gratis roam.

Underlying datastore & shard management

The newspaper discusses ZippyDB, every bit an underlying datastore that Akkio layers upon. (Akkio likewise runs over Cassandra, in addition to 3 other internally developed databases at Facebook.)

ZippyDB manages shards every bit follows. Each shard may live configured to have got multiple replicas, with i designated to live the primary in addition to the others referred to every bit secondaries. A write to a shard is directed to its primary replica, which in addition to then replicates the write to the secondary replicas using Paxos to ensure that writes are processed inwards the same social club at each replica. Reads that demand to live strongly consistent are directed to the primary replica. If eventual consistency is acceptable in addition to then reads tin live directed to a secondary.

A shard's replication configuration identifies the number of replicas of the shard in addition to how the replicas are distributed over datacenters, clusters, in addition to racks. A service may specify that it requires 3 replicas, with 2 replicas (representing a quorum) inwards i datacenter for improved write latencies in addition to a 3rd inwards dissimilar datacenter for durability.

A replica laid collection refers to the grouping of all replica sets that have got the same replication configuration. Each such collection is assigned a unique id, called location handle. Akkio leverages on replica laid collections every bit housing for µ-shards.


This figure depicts several shard replica sets in addition to a number of µ-shards inside the replica sets. You tin consider datacenters A, B, in addition to C inwards the figure in addition to imagine it goes all the way to datacenter Z if y'all like. A replica laid collection with location grip 1 has a primary at datacenter A in addition to has 2 secondaries it replicates to inwards datacenter B. Replica laid collection 345 has a primary inwards A in addition to secondaries inwards B in addition to C.

x is a µ-shard originally located inwards the replica laid collection 78 which has a primary inwards datacenter C, in addition to mayhap secondaries inwards D in addition to E. When access patterns modify hence that datacenters A in addition to B are meliorate location for µ-shard x, Akkio relocates x from replica laid 78 to replica laid 1. That agency from instantly on, the writes for x are forwarded to datacenter A, in addition to are automatically replicated to the secondaries of replica laid 1 at B in addition to C.

A helpful analogy is to recall of replica laid collections every bit condo buildings, which offering dissimilar locations in addition to flooring programme configurations. You tin recall of location grip for a replica laid collection every bit the address for the condo building, in addition to µ-shard every bit the tenant inwards the condo building. When a µ-shard encounters a demand to migrate, Akkio relocates it to a dissimilar condo edifice with suitable location in addition to configuration.

Akkio Design in addition to Architecture

I i time heard that "All problems inwards figurer scientific discipline tin live solved past times some other score of indirection". Akkio is built upon the next score of indirection idea. Akkio maps µ-shards onto shard replica laid collections whose shards are inwards plow mapped to datastore storage servers. This is worth repeating, because this is the fundamental thought inwards Akkio. When running on top of ZippyDB, Akkio places µ-shards on, in addition to migrates µ-shards betwixt dissimilar such replica laid collections. This allows Akkio to piggyback on ZippyDB functionality to furnish replication, consistency, in addition to intra-cluster charge balancing.


Above is an architectural overview for Akkio. The customer application hither is View acre (which nosotros verbalize virtually inwards the evaluation). The customer application is responsible for partitioning information into µ-shards such that it exhibits access locality. The customer application must found its ain µ-shard-id scheme that identifies its µ-shards, in addition to specify the µ-shard the information belongs to inwards the telephone yell upwards to the database customer library every fourth dimension it needs to access information inwards the underlying database.

Harkening dorsum to our analogy above, the µ-shard-id corresponds to the shout out of the tenant. If y'all cite that to Akkio, it knows which condo edifice the tenant lives instantly in addition to forwards your message. Akkio likewise acts every bit the relocation service for the tenants; every bit the needs of tenants change, Akkio relocates them to dissimilar condo buildings, in addition to keeps rail of this.

Akkio manages all these functionality transparently to the application using 3 components: Akkio location service, access counter service, in addition to information placement service.

Akkio location service (ALS)

ALS maintains a location database. The location database is used on each information access to await upwards the location of the target µ-shard. The ZippyDB customer library makes a telephone yell upwards to the Akkio customer library getLocation(µ-shard-id) business office which returns a ZippyDB location handle (representing a replica laid collection) obtained from the location database. This location grip (the condo edifice address inwards our analogy) enables ZippyDB’s customer library to direct the access asking to the appropriate storage server. The location database is updated when a µ-shard is migrated.

The location information is configured to have got an eventually consistent replica at every datacenter to ensure depression read latencies in addition to high availability, with the primary replicas evenly distributed across all datacenters. The eventual consistency is justified because the database has high read-write ratio (> 500). Moreover, distributed in-memory caches are employed at every datacenter to cache the location information to trim the read charge on the database. Note that stale information is non a large employment inwards ALS, because a µ-shard that is missing inwards the forwarded location grip volition Pb to the customer application making some other ALS query, which is to a greater extent than probable to render novel location information.

Access counter service (ACS)

Each fourth dimension the customer service accesses a µ-shard, the Akkio customer library requests the ACS to tape the access, the type of access, in addition to the location from which the access was made, hence that Akkio collects statistics for making  µ-shard placement in addition to migration decisions. This asking is issued asynchronously so that it is non inwards the critical path.

Data placement service (DPS)

The DPS initiates in addition to manages µ-shard migrations. The Akkio Client Library asynchronously notifies the DPS that a µ-shard placement may live suboptimal whenever a information access asking needs to live directed to a remote datacenter. The DPS re-evaluates the placement of a µ-shard only when it receives such a notification, inwards a reactive fashion. The DPS maintains historical migration data: e.g., fourth dimension of terminal migration to boundary migration frequency (to foreclose the ping-ponging of µ-shards).

To create upwards one's heed on the optimal µ-shard placement, DPS uses the next algorithm. First, through querying ACS, DPS computes a per-datacenter score past times summing the number of times the µ-shard was accessed from that datacenter over the terminal X days (where X is configurable), weighting to a greater extent than recent accesses to a greater extent than strongly. The per-datacenter scores for the datacenters on which the replica laid collection has replicas are in addition to then summed to generate a replica laid collection score. If at that spot is a clear winner, DPS pick that winner. Else, alongside the suitable replica laid collection candidates, DPS calculates some other score using resources usage data, in addition to overstep with the highest.

After the optimal location is identified, the migration is performed inwards a relatively straightforward manner. If the underlying datastore (such every bit ZippyDB) supports access command lists (ACL) the source µ-shard is restricted to live read-only during the migration. If the datastore does non back upwards ACL (such every bit the Cassandra implementation used at Facebook), a slightly to a greater extent than involved migration machinery is employed.


One affair I notice hither is, while the replica laid collection is a overnice abstraction, it leads to some inefficiencies for sure enough migration patterns. What if nosotros simply wanted to swap the primary replica to the location of a secondary replica? (This is a plausible scenario, because the write part may have got changed to the location of a secondary replica.) If nosotros were working at a lower layer abstraction, this would live every bit unproblematic every bit changing the leader inwards the Paxos replication group. But since nosotros piece of work on top of the replica laid collection abstraction, this volition require a full-fledged migration (following the physical care for above) to a dissimilar replicaset collection where the location of the primary in addition to the secondary replica is reversed.

Evaluation

The newspaper gives an evaluation of Akkio using iv applications used inwards Facebook, of which I volition only comprehend the get-go two.

ViewState

ViewState stores a history of content previously shown to a user. Each fourth dimension a user is shown some content, an additional snapshot is appended to the ViewState data. ViewState information is read on the critical path when displaying content, hence minimizing read latencies is important. The information needs to live replicated 3-ways for durability. Strong consistency is a requirement.

Wait... Strong consistency is a requirement?? I am surprised potent consistency is required to demo the user its feed. I estimate this is to improve the user sense past times non re-displaying something the user has seen.

Originally, ViewState information was fully replicated across 6 datacenters. (I presume that was all the datacenters Facebook had back-in-the-day.) Using Akkio with the setup described higher upwards led to a 40% smaller storage footprint, a 50% reduction of cross-datacenter traffic, in addition to virtually a 60% reduction inwards read in addition to write latencies compared to the original non-Akkio setup. Each remote access notifies the DPS, resulting inwards about 20,000 migrations a second. Using Akkio, roughly 5% of the ViewState reads in addition to writes overstep to a remote datacenter.

AccessState

The minute application is AccessState. AccessState stores information virtually user actions taken inwards answer to content displayed to the user. The information includes the activity taken, what content it was related to, a timestamp of when the activity was taken, etc. The information needs to live replicated 3 ways but only needs to live eventually consistent.

Using Akkio with the setup described higher upwards led to a 40% decrease inwards storage footprint, a roughly 50% reduction of cross-datacenter traffic, negligible growth inwards read latency (0.4%) in addition to a 60% reduction inwards write latency. Roughly 0.4% of the reads overstep remote, resulting inwards virtually grand migrations a second.


In the figure ViewState is at top, in addition to AccessState at the bottom. The figure shows per centum of accesses to remote data, the number of evaluatePlacement() calls to DPS per second, in addition to the number of ensuing µ-shard migrations per second. For ViewState the number of calls to DPS per minute in addition to the number of migrations per minute are the same.

MAD questions

1. What are some hereafter applications for Akkio?
The newspaper mentions the following.
Akkio tin live used to migrate µ-shards betwixt mutual depression temperature storage media (e.g. HDDs) in addition to hot storage media (e.g., SSDs) on changes inwards information temperatures, similar inwards spirit to CockroachDB’s archival partitioning support. For populace cloud solutions, Akkio could migrate µ-shards when shifting application workloads from i cloud provider to some other cloud provider that is operationally less expensive. When resharding is required, Akkio could migrate µ-shards, on get-go access, to newly instantiated shards, allowing a to a greater extent than gentle, incremental cast of resharding inwards situations where many novel nodes come upwards online simultaneously.
I had written virtually the demand for information aware distributed systems/protocols inwards 2015.
This vogue may bespeak that the distributed algorithms should demand to adopt to the information it operates on to improve performance. So, nosotros may consider the adoption of machine-learning every bit input/feedback to the algorithms, in addition to the algorithms becoming data-driven in addition to data-aware. (For example, this could live a skillful way to laid on the tail-latency employment discussed here.)
Similarly, driven past times the demand from the large-scale cloud computing services, nosotros may consider power-management, energy-efficiency, electricity-cost-efficiency every bit requirements for distributed algorithms. Big players already sectionalization information every bit hot, warm, cold, in addition to employ tricks to trim power. We may consider algorithms becoming to a greater extent than aware of this.
I recall the vogue is withal strong. And nosotros volition consider to a greater extent than piece of work on information aware distributed systems/protocols inwards the coming years.


2. Is it possible to back upwards transactions for µ-shards?
Akkio does non currently back upwards inter µ-shard transactions, unless implemented only client-side. The newspaper mentions that providing this back upwards is left for hereafter work.

In our piece of work on wPaxos, nosotros non only showed policies for migration of µ-shards, but likewise implemented transactions on µ-shards.

We have got of late presented a to a greater extent than detailed written report on access locality in addition to migration policies, in addition to nosotros promise to expand on that piece of work in the context of Azure Cosmos DB. 


3. How does this compare with other datastores?
Many applications already grouping together information past times prefixing keys with a mutual identifier to ensure that related information are assigned to the same shard. FoundationDB is an example, although they don't have got much of a geolocation in addition to location management story yet. Some databases back upwards the concept of divide sectionalization keys, similar Cassandra. But y'all demand to bargain with locality management yourself every bit the application developer. Spanner supports directories, in addition to a move-dir command, although Spanner may shard directories into multiple fragments. CockroachDB uses modest partitions, that should live amenable to migration. Migration tin live done past times relocating the Raft replica laid responsible for a sectionalization to goal datacenters gradually past times leveraging Raft reconfiguration. But I don't recall they have got a locality management layer yet.


4. What are the limitations to the locality management?
As the newspaper mentions, some information is non suitable for µ-sharding every bit they cannot live broken into self-contained modest parts without references to other entities. For example, Facebook social graph information in addition to Google search graph data.

When at that spot are multiple writers about the globe, the locality management technique in addition to specifically Akkio's implementation leveraging datastores that utilisation a primary in addition to secondaries volition autumn short. Also when at that spot are reads from many regions in addition to asking serving latency is to live minimized, efficient in addition to streamlined total replication is a meliorate choice.

Cosmos DB, Azure's cloud-native database service, offers frictionless global distribution across whatsoever number of 50+ Azure regions, y'all withdraw to deploy it on. It enables y'all to elastically scale throughput in addition to storage worldwide on-demand quickly, in addition to y'all pay only for what y'all provision. It guarantees single-digit-millisecond latencies at the 99th percentile, supports multiple read/write regions about the globe, multiple consistency models, in addition to is backed past times comprehensive service score agreements (SLAs).


5. Micro is the novel black
µ-services, µ-shards. And even FaaS. There is an increasingly potent vogue with going micro. I recall the root of this is because going with finer granularity makes for a to a greater extent than flexible, adaptable, agile distributed system.

0 Response to "Paper Review. Sharding The Shards: Managing Datastore Locality At Scale Amongst Akkio"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel