When distributed systems fail, it’s usually because of one of the following reasons:
- Concurrency bugs
- Inadequate failure handling
- Incorrect input validation (usually leading to security bugs) or flawed threat models
Distributed systems typically don’t have global locking, so concurrent modification of state is more likely to occur. Distributed systems are usually not “fail-stop” so hardware failures and network partitions must be tolerated. Both of these are similar to problems that occur in non-distributed systems too, and the techniques for test and validation are similar.
Model Checking and Proofs
It pains me to say this, but human-written systems proofs are basically useless. They’re almost always wrong. I cover some cases where this has occurred in my talk “Never Trust Any Published Algorithm”:
However, we can test for correctness of the system design if we can model it and identify its invariants. Languages and systems for doing so include:
- Alloy, a relational specification language and model checker. It was used by Pamela Zave to find bugs in the Chord distributed system: How to Make Chord Correct
- TLA+, a specification language and model checker by Leslie Lamport. Amazon used it to find bugs in some of their web services: Use of Formal Methods at Amazon Web Services
- Spin / Promela: an older generation of model checker focused on concurrency.
- The Coq Proof Assistant, a theorem checker that can be used for programming problems
- The KeY Project, which augments Java code with invariants and prove their correctness. Found a bug in Timsort: Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it), though I don’t know whether it has successfully been used in a distributed context
A limitation of some of these systems is that they do not work directly with the implementation code (though Key and Spin can do so.) So the model may not accurately reflect bugs in the implementation. The model checkers can find cases where the invariants fail, but not prove correctness. Fortunately (?) most real systems have bugs that can be expressed in terms of small use cases.
Failure Injection and Fuzzing
To test the real implementation, it’s necessary to define what its correct behavior looks like, and then try to break that behavior. So again we need some invariant, like linearizability in a database, or system availability in a web service, or “not crashing”. Then we can “throw a lot of spaghetti at the wall and see what sticks” by exploring the behavior of the system when we inject different types of inputs or failures. Chaos Engineering tries to formalize this into a set of practices around observability and testing in production.
- Netflix/SimianArmy is a suite of tools that induce failures or check for unusual behavior in a distributed system
- Jepsen is a library for configuring and applying load to a distributed system, inducing failures and network partitions, and looking for inconsistencies in the results. It has been quite successful at demonstrating failures in distributed databases that violate the guarantees made in the documentation.
- Fault injection libraries like libfiu wrap system calls and cause functions to return errors in predictable, repeatable ways.
- The code itself can have instrumentation like “crash points” which deliberately crash the system in critical regions. Or “delay points” can be enabled which increases the time code spends performing an operation, increasing the chance of seeing a race condition occur.
- Tools like netem or Linux Traffic Control can be used to artificially induce network latency, packet loss, or duplication. (There are also various benchmarking systems which can be used to generate large amounts of HTTP requests against a test system.)
Test systems which work mainly on single-node code can still be useful for finding classes of error common in networked systems, such as security vulnerabilities.
- libFuzzer and american fuzzy lop are tools for discovering test cases that get good code coverage, thereby exploring corner cases such as assertions and buffer overruns.
- Manticore and other symbolic execution engines build a “virtual machine” on which the program executes that has additional capabilities to help determine correctness. These systems try to provide some of the benefits of model checking while running on the real code.
- Hypothesis or QuickCheck perform “property-based testing” by trying a lot of different test cases, then finding a minimal example which violates the conditions you specified.
Monitoring and Redundancy
In addition to testing in the lab, most large distributed systems are tested in production as well. Monitoring and tracing capabilities let the system’s designers and operators understand if it is working as expected. Redundant design ensures that a request gets answered even if a software bug or component failure prevents the original server from answering.
- Systems like Honeycomb can assemble measurements taken in a distributed system into a larger picture of system health, and highlight unusual activity.
- Rolling out a new version of the software can be done in an incremental fashion, so that the service remains up while the new code is exercised by real users.
All of the above
Robust development tends to use several of these approaches, as well as human-centered ones such as code and design and test plan reviews. These tools and techniques give multiple overlapping ways to find error. For example, concurrency bugs can be found by:
- a model checker finding a problematic sequence of operations violating the model constraints
- a symbolic execution engine detecting a potential race when the code is run
- a property-based tester exploring different interleaving of operations
- a test harness exposing the inconsistency by injecting delay and failure
- observing the system in production and highlighting inconsistent behavior
All of these techniques start with some definition— even if it’s “don’t crash” —of what the correct behavior actually is. That’s often the trickiest part in any testing effort; as the system becomes more complex and gets more features, the interactions between them can be harder to describe precisely.
Originally answered on Quora: https://www.quora.com/What-approach-would-we-take-to-test-or-validate-a-distributed-system/answer/Mark-Gritter
The postings on this site are my own and do not represent DataDirect Network's positions, strategies, or opinions.
Very comprehensive review of existing methods. Thank you!
Posted using SteemSigma
Sounds like a question for StemQ! 😉
This post has been voted on by the SteemSTEM curation team and voting trail in collaboration with @curie.
If you appreciate the work we are doing then consider voting both projects for witness by selecting stem.witness and curie!
For additional information please join us on the SteemSTEM discord and to get to know the rest of the community!