Using Lightweight Formal Methods To Validate A Key-Value Storage Node Inward Amazon S3 (Sosp21)
This paper comes from my colleagues at AWS S3 Automated Reasoning Group, detailing their sense applying lightweight formal methods to a novel degree of storage node developed for S3 storage backend. Lightweight formal methods emphasize automation in addition to usability. In this case, the approach involves 3 prongs:
- developing executable reference models every bit specifications,
- checking implementation conformance to those models, in addition to
- building infrastructure to ensure the models stay accurate inward the future.
ShardStore
ShardStore is a novel append-only key-value storage node developed for AWS S3 backend. It is over 40K lines of Rust code. Shardstore is a log-structured merge tree (LSM tree) but with shard information stored exterior the tree to cut back write amplification.
ShardStore employs soft updates for avoiding the toll of redirecting writes through a write-ahead log spell yet beingness crash consistent. A soft updates implementation ensures that only writes whose dependencies are persisted are sent to disk so that whatever crash acre of the disk is consistent.
Crash consistency is of import for Shardstore, non due to durability issues ---there are already multiple replicas for ensuring 11 9s durability. Instead crash consistency is of import for reducing the toll in addition to operational touching on of node failures. A crash that loses an entire storage host's information creates lots of repair network traffic in addition to IO charge on potentially dozens of other storage nodes. Crash consistency also ensures that the storage node recovers to a condom acre after a crash so it becomes easier, faster, cheaper for it to larn dorsum online.
Validating ShardStore
For each ShardStore component, the squad developed a reference model --an executable specification inward Rust that implements the same interface every bit the actual component, but using a much simpler implementation. For instance, for the Index ingredient that maps shard identifiers to chunk locators, the ReferenceIndex uses a uncomplicated hash tabular array to shop the mapping, rather than the persistent LSM-tree implementation.
Unit tests at ShardStore's API layer usage ReferenceIndex every bit a mock implementation of the Index component, rather than instantiating the existent implementation. This increases the likelihood that the models volition stay accurate, every bit writing regular unit of measurement tests for novel functionality added to the organization oft requires updating the mock. The squad found that yesteryear writing reference models inward the same linguistic communication every bit the implementation, it became easier for the engineers to continue the models upwards to date.
For validating ShardStore, the squad decomposed the durability belongings into 3 parts, in addition to reasoned virtually each separately.
- For sequential crash-free executions, conformance to the reference model is checked straight nether property-based testing.
- For sequential crashing executions, the reference model is refined to found which information tin live on lost after a crash, in addition to that is used for checking the conformance of the implementation nether property-based testing (proptesting).
- For concurrent crash-free executions, form out reference models are written in addition to checked nether model-checking.
Checking properties of concurrent crashing executions using an automated approach is left every bit hereafter work.
Conformance checking of crash-free sequential executions
Property-based testing is used for checking that the implementation code refines the reference model. Proptesting tin live on idea of every bit an extension of fuzzing with user-provided correctness properties in addition to structured inputs, which permit it to cheque richer behaviors than fuzzing alone.
The property-based tests are designed to direct hold every bit input a sequence of operations drawn from an alphabet the squad defined. For each functioning inward the sequence, the assay instance applies the functioning to both reference model in addition to implementation, compares the output of each for equivalence, in addition to and so checks invariants that relate the 2 systems.
The squad applied techniques similar failure injection in addition to biasing arguments towards potentially problematic corner cases to improve the acre coverage of random property-based testing in addition to hence increment the likelihood of detecting issues.
Checking Crash Consistency
Two crash consistency properties were defined inward damage of the user-space dependencies in addition to checked for conformance nether property-testing:
- persistence: if a dependency says an functioning has persisted earlier a crash, it should live on readable after a crash (or if multiple operations direct hold persisted, the most recent 1 to persist should live on readable)
- forward progress: after a non-crashing shutdown, the dependency for every write should signal it is persistent
It was really surprising to come across that this was able to grab a really subtle põrnikas (issue #10 inward Figure 5) inward an automated fashion. This was a põrnikas that involved a special selection of random UUID to collide with the magic bytes, a chunk that was the right size to only barely tumble onto a instant page, in addition to a crash that lost only the instant page
Checking concurrent executions
Since measure proptesting is non suitable for exploring concurrent executions, to cheque concurrent properties, the squad wrote harnesses for telephone commutation properties in addition to validated them using stateless model checking, which explored all concurrent interleavings of a program. They usage Loom to soundly cheque all interleavings of small, correctness-critical code (e.g., custom concurrency primitives such every bit sharded reader-writer locks), in addition to Shuttle to randomly cheque interleavings of larger assay harnesses that Loom cannot scale to (e.g., end-to-end stress tests of the entire ShardStore stack).
Experience
Continuous validation was a priority for the project. The organization evolved over time, in addition to is expected to evolve further. It was of import for the automated reasoning to back upwards this in addition to supply continuous validation, beingness maintained amongst code yesteryear the developers who wrote the code.
We focused heavily on lowering the marginal toll of hereafter validation: nosotros would non direct hold considered this piece of employment successful if hereafter code changes yesteryear engineers required kicking off novel formal methods engagements. Early inward our work, nosotros wrote reference models using modeling languages nosotros were familiar with (Alloy, SPIN, in addition to Yggdrasil-style Python) in addition to imagined developing tooling to cheque the Rust code against them. It was only when nosotros discussed long-term maintenance implications with the squad that nosotros realized writing the models themselves inward Rust was a much amend choice, in addition to fifty-fifty afterward when nosotros realized the reference models could serve double duty every bit mocks for unit of measurement testing.
0 Response to "Using Lightweight Formal Methods To Validate A Key-Value Storage Node Inward Amazon S3 (Sosp21)"
Post a Comment