Replex: A Scalable, Highly Available Multi-Index Information Store

This newspaper received the best newspaper abide by at Usenix ATC'16 in conclusion week. It considers a timely of import problem. With NoSQL databases, nosotros got scalability, availability, too performance, but nosotros lost secondary keys. How create nosotros seat dorsum the secondary indices, without compromising scalability, availability, too performance.

The newspaper mentions that previous piece of work on Hyperdex did a adept undertaking of re-introducing secondary keys to NoSQL, but alongside overhead: Hyperdex generates too partitions an additional re-create of the datastore for each key. This introduces overhead for both storage too performance: supporting exactly ane secondary fundamental doubles storage requirements too write latencies.

Replex adds secondary keys to NoSQL databases without that overhead. The fundamental insight of Replex is to combine the demand to replicate for fault-tolerance too the demand to replicate for index availability. After replication, Replex has both replicated too indexed a row, thus in that place is no demand for explicit indexing.

How does Replex work?

All replexes shop the same information (every row inward the table), the exclusively departure across replexes is the agency information is partitioned too sorted, which is past times the sorting fundamental of the index associated alongside the replex. Each replex is associated alongside a sharding function, h, such that h(r) defines the partitioning expose inward the replex that stores row r.

So, that was easy. But, in that place is an additional complication that needs to live on dealt with. The difficulty arises because private replexes tin dismiss select requirements, such equally uniqueness constraints, that drive the same functioning to live on both valid too invalid depending on the replex. Figure 2 gives an illustration scenario, linearizability requirement for a distributed log.

To bargain alongside this problem, datastores alongside global secondary indexes demand to employ a distributed transaction for update operations, because an functioning must live on atomically replicated equally valid or invalid across all the indexes. But to occupation a distributed transaction for every update functioning would cripple organization throughput.

To take away the demand for a distributed transaction inward the replication protocol, they alter chain replication to include a consensus protocol. Figure three illustrates this solution. When the consensus stage (going to the correct inward Figure 3) reaches the in conclusion partitioning inward the chain, the in conclusion partitioning aggregates each partition's conclusion into a concluding decision, which is exactly the logical AND of all decisions. Then comes the replication phase, where the in conclusion partitioning initiates the propagation of this concluding conclusion dorsum upwardly the chain. As each partitioning receives this concluding decision, if the conclusion is to abort, thus the partitioning discards that operation. If the conclusion is to commit, thus that partitioning commits the functioning to disk too continues propagating the decision.

This has similarities to the CRAQ protocol for chain replication. Linked is an before post service that contains a summary of chain replication too CRAQ protocol.

Fault-tolerance

There is additional complexity due to failure of the replicas. Failed partitions convey upwardly 2 concerns: how to reconstruct the failed partitioning too how to answer to queries that would select been serviced past times the failed partition.

If a partitioning fails, a uncomplicated recovery protocol would redirect queries originally destined for the failed partitioning to the other replex. Then the failure amplification is maximal: the read must instantly live on broadcast to every partitioning inward the other replex, too at each partition, a read becomes a brute-force search that must iterate through the entire local storage of a partition.

On the other hand, to avoid failure amplification inside a failure threshold f, ane could innovate f replexes alongside the same sharding function, h; equally exact replicas. There is no failure amplification inside the failure threshold, because sharding is identical across exact replicas. But the toll is storage too network overhead inward the steady-state.

This is the tradeoff, too the newspaper dedicates "Section 3: Hybrid Replexes" to explore this tradeoff space.

Concluding remarks

The newspaper compares Replex to Hyperdex too Cassandra too shows that Replex's steady-state performance is 76% amend than Hyperdex too on par alongside Cassandra for writes. For reads, Replex outperforms Cassandra past times equally much equally 2-9x piece maintaining performance equivalent alongside HyperDex. In addition, the newspaper shows that Replex tin dismiss recover from ane or 2 failures 2-3x faster than Hyperdex.

Replex solves an of import job alongside less overhead than previous solutions. The hybrid replexes method (explained inward Section 3) tin dismiss likewise live on useful inward other problems for preventing failure amplification.