PhD Defense: Quantifying Flakiness and Minimizing its Effects on Software Testing
In software testing, test inputs are passed into a system under test (SUT); the SUT is executed; and a test oracle checks the outputs against expected values. There are cases when the same test case is executed on the same code of the SUT multiple times, and it passes or fails during different runs. This is the test flakiness problem and such test cases are called flaky tests.The test flakiness problem makes test results and testing techniques unreliable. Flaky tests may be mistakingly labeled as failed, and this will increase not only the number of reported bugs testers need to check, but also the chance to miss real faults. The test flakiness problem is gaining more attention in modern software testing practice where complex interactions are involved in test execution, and this raises several new challenges: What metrics should be used to measure the flakiness of a test case? What are the factors that cause or impact flakiness? And how can the effects of flakiness be reduced or minimized?This research develops a systematic approach to quantitively analyze and minimize the effects of flakiness. This research makes three major contributions. First, a novel entropy-based metric is introduced to quantify the flakiness of different layers of test outputs (such as code coverage, invariants, and GUI state). Second, the impact of a common set of factors on test results in system interactive testing is examined. Last, a new flake filter is introduced to minimize the impact of flakiness by filtering out flaky tests (and test assertions) while retaining bug-revealing ones.Two empirical studies on five open source applications evaluate the new entropy measure, study the causes of flakiness, and evaluate the usefulness of the flake filter. In particular, the first study empirically analyzes the impact of factors including the system platform, Java version, application initial state and tool harness configurations. The results show a large impact on SUTs when these factors were uncontrolled, with as many as 184 lines of code coverage differing between runs of the same test cases, and up to 96% false positives with respect to fault detection. The second study evaluates the effectiveness of the flake filter on the SUTs' real faults. The results show that 3.83% of flaky assertions can impact 88.59% of test cases, and it is possible to automatically obtain a flake filter that, in some cases, completely eliminates flakiness without comprising fault-detection ability.
Chair: Dr. Atif Menon Dean's rep: Dr. Gang Qu Members: Dr. Ashok Agrawala Dr. James Purtilo Dr. Alan Sussman