Fundamentals of Software Testing

Fall 2009

Course Number: CMSC737.

Meeting Times: Tue. Thu. - 9:30AM - 10:45AM (CSIC 2107)

Office Hours: Tue. Thu. - 10:45AM - 12:00PM (4115 A. V. Williams Building)

Catalog Course Description: This course will examine fundamental software testing and related program analysis techniques. In particular, the important phases of testing will be reviewed, emphasizing the significance of each phase when testing different types of software. The course will also include concepts such as test generation, test oracles, test coverage, regression testing, mutation testing, program analysis (e.g., program-flow and data-flow analysis), and test prioritization.

Course Summary: This course will examine fundamental software testing and program analysis techniques. In particular, the important phases of testing will be reviewed, emphasizing the significance of each phase when testing different types of software. Students will learn the state of the art in testing technology for object-oriented, component-based, concurrent, distributed, graphical-user interface, and web software. In addition, closely related concepts such as mutation testing and program analysis (e.g., program-flow and data-flow analysis) will also be studied. Emerging concepts such as test-case prioritization and their impact on testing will be examined. Students will gain hands-on testing/analysis experience via a multi-phase course project. By the end of this course, students should be familiar with the state-of-the-art in software testing. Students should also be aware of the major open research problems in testing.

The grade of the course will be determined as follows: 25% mid-term, 25% final exam, 50% project.

Credits: 3

Prerequisites: Software engineering CMSC435 or equivalent.

Status with respect to graduate program: MS qualifying course (Midterm+Final exam), PhD core (Software Engineering).

Syllabus: The following topics will be discussed (this list is tentative and subject to change).

Introduction to software testing

[Sep. 1, 3, 8, 10, 15] Contents: The need for testing; testing as an integral part of software engineering; software engineering processes and testing.
Slides: 1.pdf, 2.pdf
Reading List

Testing: a roadmap, Mary Jean Harrold, Proceedings of the conference on the future of Software engineering May 2000.
Introduction to special section on software testing, R. Hamlet, Communications of the ACM June 1988, Volume 31 Issue 6.
Testing: principles and practice, Stephen R. Schach, ACM Computing Surveys, (CSUR) March 1996, Volume 28 Issue 1.
Software safety: why, what, and how, Nancy G. Leveson, ACM Computing Surveys (CSUR) June 1986, Volume 18 Issue 2.
Validation, Verification, and Testing of Computer Software, W. Richards Adrion, Martha A. Branstad, John C. Cherniavsky, ACM Computing Surveys (CSUR) June 1982, Volume 14 Issue 2.

Test-case Generation

Contents: Blackbox testing; sampling the program's input space, white-box testing; path-testing; branch and predicate testing, GUI Testing.
Reading List

[Sep. 17] The category-partition method for specifying and generating functional tests, T. J. Ostrand, M. J. Balcer, Communications of the ACM June 1988, Volume 31 Issue 6. Slides: 3.pdf
[Sep. 22] A test generation strategy for pair-wise testing, Kuo-Chung Tai; Yu Lei, Software Engineering, IEEE Transactions on, Volume: 28 Issue: 1, Jan. 2002, Page(s): 109 -111. Slides: 4.pdf
[Sep. 24] Predicate-based test generation for computer programs, Kuo-Chung Tai, Software Engineering, 1993. Proceedings of the 15th International Conference on, 1993, Page(s): 267 -276. Slides: 5.pdf
[Sep. 24] A heuristic approach for test case generation, Kai-Hsiung Chang, W. Homer Carlisle, James H. Cross, II, David B. Brown, Proceedings of the 19th annual conference on Computer Science, 1991, Page(s): 174–180. Slides: 6.pdf
[Sep. 29] Hierarchical GUI Test Case Generation Using Automated Planning, by Atif M. Memon, Martha E. Pollack, and Mary Lou Soffa., IEEE Trans. Softw. Eng., vol. 27, no. 2, 2001, pp. 144-155, IEEE Press.
[Oct. 1] Student presentation by Samuel Huang. Automatic Requirements Extraction from Test Cases. Slides: Huang.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell: In this work (and in any further work) where was importance placed with regards to the rate of false positives versus false negatives? Which one is more detrimental to the overall results? In other words, which was (and will be) focused on to reduce it's rate in the work (and future work)?

Answer Outline: We would like to claim a fairly ``correct'' system (so low false positive rates), which can be attributed in part to the validation through monitor models, as well as the iterative process in which we ultimately mine for the association rules. Furthermore, because of the structurally-guided test generation plus iterations, we would like to also claim a fairly ``complete'' system as well (so also low false negative rate). The relative importance placed on FP vs. FN rates is not really a concern for us; what can be said is that we trust the correctness for any given number of iterations of our approach (which has shown to have different FN rates), but the completeness comes through these iterations of structurally-guided tests.

2. Amanda Crowell: Briefly describe some of the drawbacks of using models and how Huang et al. addressed the drawbacks in their research.

3. Christopher Hayden: While there seems to be a good deal on information regarding extraction of requirements from test cases, there doesn't seem to be much discussion surrounding the initial generation of the test cases themselves. How were the test cases used in your experiment generated? More generally, is there any relationship between the criteria used to generate test cases and the ability of those test cases to reveal invariants? For example, if I generate one test suite using data flow criteria, and I generate a second test suite using branch coverage criteria, how do those test suites compare in their ability to reveal invariants?

4. Rajiv Jain: I believe that you used two single variables for associations in your research. If A and B showed an association and B and C showed an association, then did it hold that A and C held an association in all cases. Was this caught through your test model?

Answer Outline: The situation where we find rules A -> B, as well as B -> C does not come up in our work. Because our work falls under what is considered ``class association rule mining,'' the consequent of any given association rule is of some type, and we never have two predicates in a given association rule of the same type. For this work, the right hand side is always the new state that we transition to, and we do not consider rules that have ``new state'' information on the left hand side. Transitivity of rules is therefore not an issue for us.

5. Shivsubramani Krishnamoorthy

6. Srividya Ramaswamy

7. Bryan Robbins: One of the primary decision points of the presented requirements abstraction framework is the decision of how to generate test cases. The authors explore one effect of this decision by comparing the similarity of invariant sets generated from test case sets that provide partial or full coverage, respectively. Suppose we define a measure of "% covered" for a given set of random test cases. How would we expect %covered to correlate with the similarity of resulting invariant sets, given the experimental findings of this study? Would this relationship capture all of the authors' findings regarding invariant set similarity, or only part?

Answer Outline: We have shown that for values of %covered approaching 1, the similarities are also close to 1 for our structural guided tests, while for %covered values lower (using random generation), the similarities of the invariant sets are lower. This is the expected behavior, however for a given run we may find sets that are very similar. Obviously, as the accuracy of the invariant sets approaches 1 (with a low FP rate), the similarities will all be close to 1 as well. As the accuracy is lower, the variation increases, and because %covered could entail a range of accuracies, it would not be surprising to see variation for %covered values not near 1.

8. Praveen Vaddadi: How is strength of an association rule related to its support and confidence?

9. Subodh Bakshi

10. Samuel Huang: None

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae: Does one have to consider independence/interference between recovered invariants before aggregating all invariants not found to not be invalid over any experiment as true invariants?

14. Hu, Yuening: Your experiment is done 5 times. What is the difference among them? Is a current one based on the previous one, or each of them is done independently and is just the repeat of the previous one? If so, why do you use Jaccard Similarity for more analysis?

15. Liu, Ran: How does one determine if any association rules are invariants or not?

Answer Outline: We use the term ``invariants'' only at the end of the entire iterative process of repeated test case generation, mining, and validation (repeated). At any given iteration, we are confident of the most recently mined association rules after the validation step. This set of invariants is then further refined and augmented in future iterations. Following the last validation step performed, we conclude that this set of association rules comprises of the ``invariants.''

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep: In terms of coming up of test cases to discover invariants, was length of the test cases considered? Do you think longer test cases will discover more invariants?

Answer Outline: Rather than focusing on test case length, we focused on test suite coverage. We felt that this was more indicative of exploring the state space of the design model. However, one would expect that with longer test cases, more of the state space would be uncovered as well. We do show that we get almost complete coverage with our iterative approach using structural coverage.

18. Guerra Gomez, John Alexis: This paper defines an invariant to be a characteristic that is always true for all tests executed. Consider a less strict definition of an invariant, i.e., a characteristic that is mostly true (e.g., 90% or more of the time)? What would be the advantages of such a modified definition? And disadvantages?

[Oct. 6] Using a Pilot Study to Derive a GUI Model for Automated Testing, by Qing Xie and Atif M. Memon, ACM Trans. on Softw. Eng. and Method., 2008, ACM Press.
[Oct. 8] Using GUI Run-Time State as Feedback to Generate Test Cases, by Xun Yuan and Atif M. Memon. In ICSE '07: Proceedings of the 29th International Conference on Software Engineering, May 23-25, 2007, pp. 396-405. (its extended version: Generating Event Sequence-Based Test Cases Using GUI Run-Time State Feedback, accepted for publication to the IEEE Transactions on Software Engineering).

Experimentation and empirical studies in testing: [Oct. 13]
Midterm exam: [Oct. 15]
Test Adequacy

Contents: Test coverage and adequacy.
Reading List

[Oct. 20] General discussion and overview. Slides: 7.pdf
[Oct. 20] Coverage criteria for GUI testing, by Atif M. Memon, Mary Lou Soffa, and Martha E. Pollack, ESEC/FSE-9: Proceedings of the 8th European software engineering conference held jointly with 9th ACM SIGSOFT international symposium on Foundations of software engineering, (New York, NY, USA), 2001, pp. 256-267.
[Oct. 20] Student presentation by John Alexis Guerra Gomez. Testing Python Applications – a case study. Slides: testing_python_applications.pdf

Questions posed by students:

1. Nathaniel Crowell: Generally, how has the selection of python for software development affected the project?? Specifically, how have known flaws in various versions of python (e.g., memory leaks) guided your development and testing process?

2. Amanda Crowell: What kind of testing was done to verify that the signals being sent to the device were actually causing the correct vibrations on the device?

3. Christopher Hayden: In your presentation, you noted that you are currently working on implementing what appear to be large system tests and that you plan to implement GUI testing at some point in the future. However, I didn't see anything about small system tests or unit tests. Are these something you plan on doing in the future as well? If so, will you be using a particular unit testing framework to generate the tests, execute the tests, and recover the metrics?

4. Rajiv Jain: What makes testing python different than any other programming language such as C/C++ or Java?

5. Shivsubramani Krishnamoorthy: You displayed the code coverage with one of the input .png image. So, am I right in assuming that the coverage is based on the input image? If you provide a different image, will the coverage differ? I am not able to imagine how the coverage will differ that way. It could be because I don't know the modules in your source code. Can you throw light on it?

6. Srividya Ramaswamy: Were the techniques you used to generate test cases for your large system tests similar to the ones we have seen in class? (You mentioned that your program uses command line operations after receiving an image to process. That immediately made me think of 'Category Partition Method' as the example we saw for that method was a Unix command). If you didn't use any of the techniques we have seen in class, could you briefly describe the technique you used?

7. Bryan Robbins: As others have mentioned, you focused here on Large System Tests. Interpreted languages have a pretty nice quality in that they can be explored interactively using the language's interpreter. Have you considered how you could leverage some pretty simple scripting in order to perform your future testing at a lower level?

8. Praveen Vaddadi: Does the bytecode of a python program (*.pyc) contain information about statement coverage? If not, how do we get the line coverage from bytecode compilation?

9. Subodh Bakshi: You said that you have test cases generated by the program receiving images to process operations using command line and then sending the codified image to the default IRIS device. You also said that you have a dummy device as the oracle. But since, you do not do a diff, how exactly is reference testing carried out?

10. Samuel Huang: Your work seems to step into many different domains, notably touching upon challenges from both Computer Vision and HCI. Given that you are seeking to avoid increasing the sophistication of the hardware (to prevent the cost of building a unit from raising further), perhaps you could draw upon some of the existing approaches from these other fields to improve the software's sophistication, especially at the user interface level. Even if additional controls are not possible for the device currently, having a demo that simulates what would happen if buttons were pressed may be of interest to potential investors. Does this seem feasible?

11. Ghanashyam Prabhakar

12. Jonathan Turpie: The images sent to the IRIS device are very pixilated. Do you think it anti-aliasing would work for the vibration images the same way it does for vision-based images?

13. Cho, Hyoungtae:

14. Hu, Yuening: Do you get the line coverage report based on "*.pyc" files? For java and Emma, we need to compile the source code with "debug=on" and then instrument, then the line coverage report can be obtained. Do you need to the same thing?

15. Liu, Ran: You mentioned that there is a dummy to capture the codified image output and keep the output in a log. What will you do after that? Will you do reference testing by comparing the output to the test oracle or any other technique to see if there are any faults?

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Regression Testing I

Contents: Test maintenance.
Reading List

[Oct. 22] Scott McMaster, Atif Memon, Call-Stack Coverage for GUI Test Suite Reduction, IEEE Transactions on Software Engineering, vol. 34, no. 1, pp. 99-115, Jan., 2008.

Usability Testing

[Oct. 27] Student presentation by Hyoungtae Cho. Usability testing: what have we overlooked? Slides: Hyoungtae.pdf

Questions posed by students:

1. Nathaniel Crowell: Based on the cited example that "3 test sets with 5 users" produces better coverage than either "1 test set with 15 users" or "15 test sets with 1 user", could the reason for this just be that what is really important is diversity along these two of the test axes rather than some "magic number" of 5 users?

2. Amanda Crowell: Both in the paper and in your presentation, it was mentioned that participant recruitment plays a big role in the results of usability testing.? Did the researchers offer any suggestion as to what effective recruiting methods should be?? What is your opinion on how one should recruit usability testers?

3. Christopher Hayden: The paper treats the number of users and the number of user tasks as independent of one another in its statistical calculations. It seems that this might be suboptimal in the sense that clearly both are important components in usability testing. Would it make more sense to model these two quantities as related to one another, say through a higher dimensional analysis that considers both quantities simultaneously? For instance, one could create a three dimensional plot where the axes are number of users, number of user tasks, and number of problems found and perform similar statistical analysis to that performed in the paper. Are you aware of any research that performs this type of higher dimensional analysis with regard to usability testing?

4. Rajiv Jain: What makes something a usability problem? According to the paper, it seems usability problems have something to do with the amount of time a task takes. However, if I were to use that measurement, a useable device such as a cool PHONE would be full of problems given the time it would take my parents to use it.

5. Shivsubramani Krishnamoorthy: From the paper and from other sources, I understand that1-(1-probability of finding a fault)number of users assumes that the faults are equally detectable. It considers only the number of users but not various critical variables like the type of the application, size of application, type of test user and their experience etc. These may affect the context, which may have various critical problems/faults associated with it. Will it be true that once these variables are incorporated into the equation may anyway disprove Jacob Neilsen's claim (especially for software usability testing).

6. Srividya Ramaswamy: The paper concludes saying that the role of number of users in a usability test is not as significant as the important role of task coverage. However, I think that these two factors are related. Could you comment on that?

7. Bryan Robbins: A number of the authors' conclusions and suggestions for future research in the area involve improving the effectiveness of both usability testers and user tasks. A large body of work on the topic of finding problems in requirements and design artifacts suggests that individual effectiveness often dominates performance. Given the authors findings, would you expect the same to be true for usability testing? How could we fairly test this hypothesis?

8. Praveen Vaddadi: The paper claims that user task coverage is more important than number of participants in predicting proportion of problems detected. However, do you think it misses out an important factor of considering the type of the participant?

9. Subodh Bakshi: Is there a way to define the number of user tasks to improve the proportions of problems?

10. Samuel Huang: The model used to address the likelihood of discovering a fault/problem is using a ``noisy-or'' like approach which assumes independence between faults. Given that this is probably not always true, can we find or otherwise determine more structured relationships between faults? Can we then capitalize on this structure to better predict certain ``types'' of faults?

11. Ghanashyam Prabhakar

12. Jonathan Turpie: With the Found(i) = N(1-(1-λ)^i), one of the concerns raise in class was that λ is not constant for all faults. Instead of a constant, maybe substituting λ with a random variable that looks like the distributions of detection rates (how many users detect a particular fault) for faults. Do you think that there is a common distribution for λ that can be used to generate a new 5 users like rule, or do you think λ varies too much from program to program?

13. Cho, Hyoungtae:

14. Hu, Yuening: What is the influence of this paper? Is it successful in changing people's focus from uses' number to users' tasks?

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Software Testing Effectiveness

[Oct. 27] Student presentation by Nathaniel Crowell. Chen, T. Y. and Merkel, R. An upper bound on software testing effectiveness. ACM Trans. Softw. Eng. Methodol. 17, 3 (Jun. 2008), 1-27. Slides: Nathaniel.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell

2. Amanda Crowell: Do you agree with the assumption that only the first fault found is of immediate concern?

Answer Outline: The answer to this depends on the goal of testing. I agree that for actual software development and debugging that any faults found after the first could just be as a result of the first rather than some additional fault in the software.

3. Christopher Hayden: The paper discusses the fact that adaptive random testing achieves an F-measure that is very close to optimal. However, no other testing techniques are explicitly discussed. It seems that test suites developed using various criteria, such as line coverage or branch coverage, should correlate in some way with input space coverage and should therefore have empirically measurable F-measures. How do test suites developed using such criteria compare with adaptive random testing in their ability to approach the optimal F-measure?

Answer Outline:

4. Rajiv Jain: My understanding is that the input domain can often be infinite and multi-dimensional. How would this algorithm apply in these cases?

Answer Outline: In these cases, it might not be possible to explore the input domain using no knowledge of the software under test. Using white-box techniques it could be possible to try and determine the extremes of the input domain without having to do a full analysis of the code to determine exactly all of the critical values that produce different coverage.

5. Shivsubramani Krishnamoorthy: Is F-measure the right criteria to measure the effectiveness of a test strategy? There are possibilities that for a strategy the first fault may be revealed late but many faults could be revealed in quick succession after that. Another strategy may find the first bug with lesser number of test cases, but may find just a few bugs altogether. So will it be better to consider the f-measure for `n' faults, instead of just one (first) fault - (i.e. number of test cases to reveal first `n' number of faults).

Answer Outline: Research exists to suggest that for real faults, they tend to be clustered in the input domain. Given this knowledge, it seems likely that a test case that reveals one fault will reveal others. Or that given some test case that has been found to reveal one fault, other faults can be found by only making small changes in the input space. These reasons support the F-measure as a useful criterion of effectiveness.

6. Srividya Ramaswamy: Currently, we are not aware of any algorithm to determine an optimal test pattern for an arbitrary shape. We know that any arbitrary pattern can be geometrically covered by a continuous set of squares (high school Mathematics). Thus, can we find the F-measure of each of those squares and then find a way of approximating that result to the F-measure of the arbitrary pattern?

Answer Outline:

7. Bryan Robbins: The authors define an upper-limit relative to the F-measure of random testing with replacement. Is this a useful reference point for practicing testers? Why or why not?

Answer Outline: Obviously, testing strategies can be implemented that performs worse than the upper-limit. I think that the value of this upper-limit is in evaluating testing strategies. For a given strategy that has been examined, what is learned about it can be used by the practicing testers to help them have level of confidence in their testing results.

8. Praveen Vaddadi: The FSCS-ART strategy seems to work fine by reducing the number of test cases necessary to detect first failure by up to 50%. How do you think would this strategy do for higher dimensions, since in a real testing area input domains usually are far from being one- or two-dimensional. I also gather that in higher dimensions, failure patterns seem to be preferably located around the center of the input domain. What do you think would be the best approach with regard to FSCS method to solve this dimensionality problem?

Answer Outline: I think that we saw earlier in the semester that even for applications that seem to have a vast number of combinations of parameters/options, this number can be greatly reduced by examining what combinations are actually allowed by the software. I think that the problem of multi-dimensions input domains is not unique to FSCS-ART and is a problem that affects any testing strategy.

9. Subodh Bakshi: I agree this paper deals with testing effectiveness, but how does defining an exclusion region justify correctness? As in once the exclusion region is defined, why do we (in a way) stop that failure region being overlapped by another test case?

Answer Outline:

10. Samuel Huang: The tessellation discussion seems interesting, but it also seems limited in the types of unbounded regions that can be represented. A quadratic relationship, for example, seems problematic. Is there a class of shapes that we could formally describe to be unrepresentable?

Answer Outline:

11. Ghanashyam Prabhakar

12. Jonathan Turpie: Why do the authors assume the failure region is contiguous? They claim later that all the theorems presented in the paper work when the regions are not contiguous, but this is hard to believe without any evidence.

Answer Outline:

13. Cho, Hyoungtae:

14. Hu, Yuening

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Random Testing

[Oct. 29] Student presentation by Shivsubramani Krishnamoorthy, On Random Testing (sources: (1) Ciupa, I., Leitner, A., Oriol, M., and Meyer, B. 2007. Experimental assessment of random testing for object-oriented software. In Proceedings of the 2007 international Symposium on Software Testing and Analysis (London, United Kingdom, July 09 - 12, 2007). ISSTA '07. ACM, New York, NY, 84-94. DOI= http://doi.acm.org/10.1145/1273463.1273476, (2) Ciupa, I., Leitner, A., Oriol, M., and Meyer, B. 2008. ARTOO: adaptive random testing for object-oriented software. In Proceedings of the 30th international Conference on Software Engineering (Leipzig, Germany, May 10 - 18, 2008). ICSE '08. ACM, New York, NY, 71-80. DOI= http://doi.acm.org/10.1145/1368088.1368099). Slides: Krishnamoorthy.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell: The authors site that they use the Levenstein distance in determining the distance between string parameters on an object.? Considering how strings are typically used in software, does the use of this measure really make sense? Should any weight be given to it when comparing objects?

2. Amanda Crowell: In the ARTOO paper, the researchers stated that their object distance calculation only takes into account the syntactic form of the objects and does not account for their semantics.? They suggest, as an optimization idea, that semantics should also be considered.? Should this really be an optimization rather than an integral part of ARTOO?

3. Christopher Hayden: In the paper on ARTOO, the authors specifically address the issue of boundary values for primitive types by including a parameter that allows the test generation to probabilistically select from a set of predefined values suspected to be likely to cause errors. However, it is also possible for more complex objects to have boundary "states" likely to produce errors, yet the paper makes no provision for such a possibility. In many cases, these states may be very unlikely to occur through random manipulation of the object and so are unlikely to be reached by ARTOO. Do you believe there is a case to be made for including a predefined pool of objects that can be sampled probabilistically, much like there is for primitive types?

4. Rajiv Jain: Given the Pros and Cons of pseudo-random testing techniques shown in these papers, when would you choose to use random testing as opposed to the standard techniques we have learned in class? When wouldn’t you?

Answer Outline: I personally feel there isn't a clear defined criterion to either go or not go with RT. I believe it is up to the tester to decide based on few aspects. In addition to find this paper mentioning general advantages of RT, I found a paper that mentions that RT would be preferred when there is no clear idea about the properties of the input parameters. The best approach then, in terms of time and effort, would definitely be RT. Another paper “An Evaluation of Pseudo-Random Testing for Detecting Real Defects” concludes that RT is good when the faults are easily detectable. In that case the expected test length is low. But it isn't as simple as that when the faults have low detectability. So, the tester needs to make a decision based on what kind of program/application that he is testing.

5. Shivsubramani Krishnamoorthy

6. Srividya Ramaswamy: For the "full distance definition" the values of all the constants are taken as 1 except for alpha and R which are 1/2 and 0.1 respectively. On what basis are the values for these constants determined?

Answer Outline: The authors have used 1 as the values for most of the constants, may be, because they did not want to bias any of the components in the equation. But, probably, 'alpha' is given a value of 0.5 to bring down the effect of the component in the equation in case of large number of fields in the object. The authors mention in the previous paragraph about averaging in the field distance value to handle situations where objects have large number of fields.

7. Bryan Robbins: In the 2007 paper, the authors fit the curve of bugs found over time to a function f(x) = a/x + b. This curve is potentially useful in informing a timeout for random test execution, but the values of a and b vary widely depending on the application under test. What would a tester have to do in order to utilize the curve as a model?

Answer Outline: The parameters a and b definitely would vary based on the size of nloc and similar attributes. But I believe they may not impact the process here because though they vary with different programs, they remain constant within a program. And the in this process of finding fault rate, we would focus just on one particular program which would keep a and b constant throughout. Thus the focus would be on analyzing the values of f(x) for a series of values of x (x, x-1, x-2... x-n). The values of a and b can play role in deciding upon the value of n here.

8. Praveen Vaddadi: The criteria the paper (ARTOO: adaptive random testing for object-oriented software) used were the F-measure and the time required to reveal the first fault to compare the RT and ART techniques. Do you think this is the right metric to use because there is a good chance that either of the techniques can detect a bunch of faults even if they find them late? I feel a metric that uses the number of faults detected in a stipulated amount of time could throw a better light in comparing the two techniques.

Answer Outline: It is quite intuitive to ask this question. I had the same question while reading the paper. I believe one simple reason why the authors chose this criterion could be because it is a widely accepted criterion and is used in many papers stating software-testing experiments. But, I feel a better criteria should be to measure the time/test cases required to reveal 'n' number of faults (instead of just 1). In that case, the problem you stated would be handled better.

9. Subodh Bakshi

10. Samuel Huang: One estimate given for running a particular series of experiments assumes we brute force search over all combinations of input parameters, which blows up exponentially in the input size. Could we use some sort of a guided search method automatically to reduce this? As an example, maybe we could perform some sort of a guided random walk through the parameter space?

Answer Outline: The brute force approach is not the best one for sure. But, the basic idea of ART itself filters lot many inputs (though the selection process is time consuming). Another approach the authors specify for another reason is clustering the inputs. This could also be a good approach to filter the inputs, such that an (or specified few) input selected from a cluster automatically filters other inputs. Thus, you avoid searching through a major part of the input space.

11. Ghanashyam Prabhakar

12. Jonathan Turpie: How do you deal with complex method contracts (preconditions)? For example there are two input arrays that must have exactly the same length. There may be something wrong with another class that would cause the “method under test” to be run on invalid input.

13. Cho, Hyoungtae:

14. Hu, Yuening

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis: They proposed to use ARTOO when the number of bugs found by RT starts not growing so fast, but doesn't all the finding bugs techniques have a common growing rate that starts decrementing with the time (when there are not to many more bugs to be found)?

Answer Outline: Through the experiments, the authors showed that fewer test cases would be required to find a bug using ARTOO since the input objects are more evenly distributed. You are right in pointing out that the authors haven't proved that ARTOO's rate of finding the faults does not decrease over time. But, we need to understand that the usual RT performance decreases over time since it exhausts all the possible faults in the narrow area it covers. So the authors suggest that try to find all these easily findable faults using RT (which it does faster) and then use ART to find the faults distributed elsewhere with minimum cost (in terms of required input objects). The rate of finding faults (which implies how fast you find faults) has anyway gone down (with RT), so now focus on finding faults with minimum object inputs.

Database Testing I

[Oct. 29] Student presentation by Praveen Vaddadi, Chays, D., Deng, Y., Frankl, P. G., Dan, S., Vokolos, F. I., and Weyuker, E. J. 2004. An AGENDA for testing relational database applications: Softw. Test. Verif. Reliab. 14, 1 (Mar. 2004), 17-44. DOI= http://dx.doi.org/10.1002/stvr.v14:1. Slides: Praveen.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell

2. Amanda Crowell: The paper states that there are disadvantages, discussed in another paper, to using live data to test databases but doesn't list exactly what these are. It seems, however, that there are also some advantages to using live data that were not discussed at all. Wouldn't the AGENDA method be a good way to actively monitor your database to discover problems as/if they occur?

Answer Outline: The paper alludes to another previous paper in this line to discuss various disadvantages of using live data. The live data may not reflect a sufficiently wide variety of possible situations that could occur. That is, testing would be limited to situations that could occur given the current DB state. Even if live data encompasses a rich variety of interesting situations, it might be difficult to find them, especially in a large DB, and difficult to identify appropriate user inputs to exercise them and to determine appropriate outputs. Also, testing with data that is currently in use may be dangerous since running tests may corrupt the DB so that it no longer accurately reflects the real world. Finally, there may be security and privacy constraints that prevent the application tester from seeing the live data.

3. Christopher Hayden: In the paper, the authors use the category-partition method to populate the test database. They dismiss the possibility of random testing, arguing that generation of valid inputs is likely to be inefficient and that the inputs should be specifically selected so as to maximize the likelihood of errors. There are problems with both of these first arguments. First, constraints could be placed on the random generation of data such that only valid inputs are produced. Second, several empirical studies have shown, surprisingly, that the category-partition method is generally inferior to random testing. Third, the process of generating random data can be augmented so as to include values that are deemed likely to cause errors are included. In light of these facts, can you provide a further explanation of the disadvantages of applying random testing to databases, or do you believe that the authors may have dismissed the possibility too lightly?

Answer Outline:

4. Rajiv Jain: There seems to be no empirical evidence presented in this paper about Agenda’s fault detection effectiveness, so why should we use this tool to test database applications?

Answer Outline: I agree that the paper describing the AGENDA framework did not discuss its effectiveness and performance issues in detail. However, the same authors have written a second paper on AGENDA (http://portal.acm.org/citation.cfm?id=1062455.1062486) which mainly talks about how various issues involving the limitations of the initial AGENDA are resolved. In the end, they present an empirical evaluation of the new improved AGENDA. They ran it on transactions from three database applications with seeded faults and report that about 52% of the faults were detected by AGENDA (i.e. 35 out of 67 faults). But this empirical study too has its own limitations and threats to validity like the faults were seeded by graduate students who were well versed with AGENDA. But nevertheless, a study on its effectiveness and fault detection abilities has been reported in the subsequent paper.

5. Shivsubramani Krishnamoorthy: The authors, in the paper, describe about the data generation for the database. I was wondering why the already existing data was not made use of? I don't reckon the authors specifying anything about it. Why was the trouble taken in generating the data, rather than using the live data?

Answer Outline: There could be many issues with using live data for testing database applications. Live data may not reflect sufficiently wide variety of situations and may be difficult to find the situations of interest. Also, live data may violate some privacy and security constraints as required by the database system.

6. Srividya Ramaswamy: The paper mentions that the AGENDA prototype is limited to applications consisting of a single SQL query. Could you discuss a few changes that would have to be made if the prototype were to be extended to applications consisting of multiple queries?

Answer Outline: Extending the system to handle multiple queries would most probably have the same underlying framework with subtle differences. For example, one potential problem with having multiple queries (and in turn having multiple host variables h1, h2, h3,..., hn) is that instantiating host variables independently will create queries whose WHERE clauses will not be satisfied by any tuples in the database. We would like most of the test cases to cause the WHERE clauses in the queries to be satisfied for some tuples in the database, exemplifying more typical application behaviour. Similar problem occur with UPDATE, INSERT and DELETE queries. This issue can be handled in the extended AGENDA by distinguishing between type A (which causes WHERE clause to be true for some tuples) and type B (others can be designated type B) test cases. Heuristics can guide the extended AGENDA to select type A test cases by taking account of dependence between different host variables that are instantiated in a test case. In a similar way, the extended AGENDA can have transition consistency checks and integrity constraint checks, which are a high scale version of the proposed AGENDA framework handling multiple tables and queries.

7. Bryan Robbins: One difficulty the authors encountered was the inability to fully describe the state of the database. They addressed this by having developers provide preconditions and postconditions, but this introduces more work and potentially more faults. Could an approach like GUITAR's Ripper be applied to the database to capture the runtime DB state? What are the critical differences between DB state and GUI state?

Answer Outline: AGENDA prompts the tester to input the database consistency constraints in the form of post- and pre-conditions, which are typically boolean SQL expressions. These boolean expressions are translated into constraints that are automatically enforced by the DBMS, by creating temporary tables to deal with joining relevant attributes from different tables and to replace calls to aggregation functions by single attributes representing the aggregate returned. It can also be argued that GUI testing is a "type" of database testing, wherein, if we consider the case of Event Flow Graphs in GUITAR, it can be represented as a database table with relevant constraints and pointer functionalities with each tuple being a mapping from event to event. Thus, apart from the state consistency checks in a database system, I don't really think that GUI testing is different from database testing.

8. Praveen Vaddadi

9. Subodh Bakshi

10. Samuel Huang: The fact that the state generation starts with the tester's sample-values files seems like an interesting concept. Perhaps something more formal could be done that explicitly adjusts the probability mass of inserting specific tuples, to allow for some theoretical analysis as to what types of tuples will appear in the populated database.

11. Ghanashyam Prabhakar

12. Jonathan Turpie: Previously, for GUI systems we were told that writing operators for everything you wanted to test in a system was too much effort. Why is this any different for a program that uses a database system?

Answer Outline:

13. Cho, Hyoungtae:

14. Hu, Yuening

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis: They used the contracts and exceptions as oracle, did they consider non-error exceptions as a bug?

Answer Outline: AGENDA's approach to generate and execute test oracles was based on integrity constraints, specifically state constraints and transition constraints. A state constraint is a predicate over database states and is intended to be true in all legitimate database states. A transition constraint is a constraint involving the database states before and after execution of a transaction. Thus there was no need to check for non-error exceptions in this scenario.

Web Testing

[Nov. 3] Student presentation by Rajiv Jain: Sebastian Elbaum, Gregg Rothermel, Srikanth Karre, Marc Fisher II, "Leveraging User-Session Data to Support Web Application Testing," IEEE Transactions on Software Engineering, vol. 31, no. 3, pp. 187-202, March, 2005. Slides: Rajiv.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell: None of the presented technique seems to really get out ahead of the others as clearly better; however, what do you see as the biggest advantage that user session data provides?? Can this advantage be applied to testing in general (assuming costs are not a concern)?

Answer Outline: The basic contribution of User Session Testing is sampling of the input space that approaches or beats the performance of more rigorous techniques in an automated fashion. The approach taken by the authors could be used with testing any system that requires user input, but web servers provide an easy way to observe and capture user interactions with server logs or packet captures that may not be present with other software. It isn’t clear that this technique is required outside of Web or GUI applications because traditional white box testing techniques probably can be performed in an automated fashion with better results.

2. Amanda Crowell: You discussed several issues with the way the researchers performed the experiment? If you could suggest changes they should make to make the experiment better, what would they be?

Answer Outline:

3. Christopher Hayden: In class we discussed in some detail the similarities between web applications and GUI applications. The paper proposed a variety of ways of using user sessions to create test cases for web applications. One way of generating test cases for a GUI application is to have a tester create a session and record it for later playback, but to the best of my knowledge no one has suggested using actual user sessions for GUI testing. Do you feel that there's value in extending this technique to standard GUI applications, or does it rely on web-specific characteristics such as high user volume and a client-server infrastructure? Can you suggest ways by which any such constraints can be overcome? Alternatively, does the lack of attention to this idea in GUI testing suggest anything negative about its perceived effectiveness?

Answer Outline: The ideas behind user session testing can be applied to any software, but the server architecture certainly makes implementation much easier in web applications. There are certainly tools in GUIs already collecting user sessions such as toolbars in web browsers and the Usage Data Collector in eclipse. I would event venture to guess crash reporting tools becoming popular in commercial GUIs are collecting the tool usage immediately prior to a crash for similar to a test case in a capture replay tool. I therefore believe it would not be too much work to extend this idea to the realm of GUIs. Unlike with web servers there is no protocol like http governing interaction with the GUI program so the implementation of collecting user sessions could become application specific limiting its widespread usefulness. I believe the reason that this has not been observed with GUI testing is that a better automated technique already exists. The test cases produced by User Session Testing in GUIs would be a subset of the test cases that could be generated by the GUITAR framework. The next step to User Session Testing would be to develop a system similar to GUITAR for web testing.

4. Rajiv Jain:

5. Shivsubramani Krishnamoorthy: As I had mentioned in the class, all I thought about the web application to be different from a normal application is its real time usage. When reading the paper, I was expecting the authors to stress on real time data. Do you think the data recorded from the user session play role of real time data? I was not able to visualize how. Do you think the experiment is incomplete in this sense?

Answer Outline:

6. Srividya Ramaswamy: The paper mentions that during the Test Suite Creation for the experiment the users were asked to choose two of the four given courses. I am sure that the percentage of users choosing a particular course has a huge impact on the session data generated. However, the paper does not talk about the percentage of students who chose each of the courses. Can you comment on this?

Answer Outline:

7. Bryan Robbins: The authors discuss that the cost of capturing additional user sessions is significant, and suggest a number of filtering and combining techniques for making the most out of the data given. There exist a number of techniques (e.g. Expert systems, cognitive architectures, etc.) that are capable of modeling tasks under constraints similar to those of human cognition. While these would be somewhat imperfect, is there a reason to think that task models would be capable of generating session data, or are there aspects of users that make them irreplaceable?

Answer Outline: Using humans will always represent the subset of realistic input better than any other representation, but close representations could certainly be modeled through smart web crawlers. Both approaches are automated from the tester’s viewpoint so why not use both if User Session test cases are sparse. The better approach between the two is debatable, but the larger issue is whether an emphasis on realistic user-sessions is desired. Either approach would likely result in an oversampling of certain test-cases while entirely missing certain faults. A better approach, if possible, would be to find a way to automatically generate a graph that models the web application similar to Ricca and Tonella. One then could traverse the graph with a number of algorithms to generate more diverse test cases and then traverse all edges or all nodes in addition to using the user sessions.

8. Praveen Vaddadi: Testing the security aspect is critical for web testing. Clearly, this involves testing the security of the infrastructure hosting the web application and testing for vulnerabilities of the web application. Firewalls and port scans can be considered for testing for infrastructure, however how do we test the vulnerabilities of the web application?

Answer Outline:

9. Subodh Bakshi

10. Samuel Huang: The graph representation proposed by Ricca and Tonella for modeling web applications seems appealing. The notation used to describe path expressions on these graphs (such as ``(e1e3 + e4e5) * (e1e2 + e1e3 + e4e5)'') looks reminiscent of Pi-Calculus/CCS as proposed by Milner in the 1980s. In that work, bisimilarity between nodes was used as a notion of equivalence. Could we use the notion of bisimilarity in generating test cases that go beyond independent paths, and also avoid ``bisimilar'' edges?

Answer Outline: I am not sure that is desirable to avoid bisimilar edges in web applications because even small changes in the state of the program can result in execution of different underlying code. For example, while e1e3 and e4e5 could be considered bisimilar, different portions of the Perl code are likely check the two different form variables. It also seems as though the cost savings over linear independent paths would be insufficient to warrant their exclusion.

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening: Seven groups are designed for the experiments, including WB-1, WB-2, US-1, US-2, US-3, HYB-1, HYB-2. Since the test suite size varies, can we simply say that US-3 and WB-2 is better than the others? And I think HYB-1 and HYB-2 should perform better than the others since it is the combination of white box testing and user-session techniques, but the results turned to be not, what do you think maybe the potential reasons?

Answer Outline:

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis: When the authors did the user session slicing and combining did they use any heuristic to choose the mixing point?

Answer Outline:

Regression Testing II

[Nov. 3] An empirical study of regression test selection techniques, Todd L. Graves, Mary Jean Harrold, Jung-Min Kim, Adam Porter, Gregg Rothermel, ACM Transactions on Software Engineering and Methodology (TOSEM) April 2001, Volume 10 Issue 2. Slides: 8.pdf

Defect Estimation

[Nov. 5] Student presentation by Bryan Robbins. Briand, L.C.; El Emam, K.; Freimut, B.G.; Laitenberger, O., "A comprehensive evaluation of capture-recapture models for estimating software defect content," Software Engineering, IEEE Transactions on, vol. 26, no.6, pp.518-540, Jun 2000. Slides: Bryan.pdf

Questions posed by students:

1. Nathaniel Crowell: Generally, how can capture-recapture be applied to GUI testing? How do various components of the system translate to software testing and more specifically to testing of GUIs?

2. Amanda Crowell: Could using virtual inspections have biased the results regarding the appropriate number of inspectors to use?

3. Christopher Hayden: We saw in the paper that all of the estimators underestimate the total number of faults and that they all have a large variance. We also discussed the fact that there are seemingly contradictory results, with one paper recommending the Jackknife estimator and another indicating that the Jackknife estimator does quite poorly. Given that in actual software development there are insufficient resources to perform an experiment to determine the best estimator to apply to a given project, would there be any value in applying all of the estimators and merging their results in some way?

4. Rajiv Jain: Do you think that capture recapture is a more valid measurement of faults then using simpler measures such as the line count or code complexity to compute the ratio of errors in the code?

5. Shivsubramani Krishnamoorthy: It is mentioned in the paper (section 2.2) that defects captured by more than one inspector will usually have a higher probability of detection, which sounds quite obvious. The authors claim that the estimators depend on the order of trapping occasions. Do you think it holds good in inspection context? How do you think does the order of trapping influence the estimators?

6. Srividya Ramaswamy: The paper talks about the models Mh and Mt. In Mh we assume that every defect j has probability pj of being detected, which is the same for every inspector. In Mt every inspector has probability pi of detecting every defect. I understand that the paper deals with documents. However, if these models were applied to Software Testing do you think the estimators using these models should assume anything regarding the code coverage? When two inspectors (testers) use overlapping test cases their detecting capability is bound to be related. Isn't it?

7. Bryan Robbins:

8. Praveen Vaddadi: The paper tries to address the impact of the number of faults on the CR estimators. However, I did not find a direct solution to this issue in the paper. Do you think a consensus can be arrived upon the minimum number of faults that have to be present in a document before the CR estimators can be used using the experiment described in the paper?

9. Subodh Bakshi

10. Samuel Huang: In the discussion section of the paper, the authors comment that ``failure rates for low numbers of inspectors were high, and therefore several models should be used to prevent failure to obtain an estimate.'' Does this mean that the authors want to aggregate different sets of few observers in some meta-sense, or raise the number of inspectors? (or something else?)

11. Ghanashyam Prabhakar

12. Jonathan Turpie: Different inspectors are still likely to find many of the same defects, even if they are not colluding just because they are using similar techniques to look for defects. Do you think that having the inspectors find defects in a document with a known number of defects and filtering out inspectors that overlap too much would help make the estimation of defects in general with the remaining inspectors better?

13. Cho, Hyoungtae:

14. Hu, Yuening: Based on mounts of experiments and analysis, the author recommended using a model taking into account that defects have different probabilities of being detected and the Jackknife estimator. And I noticed that this is a paper published in 2000. So I wonder whether these recommendations have being applied in real practice: is it useful in real industry? Is it accurate?

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Runtime Test Case Generation

[Nov. 10] Student presentation by Amanda Crowell. RUGRAT: Runtime Test Case Generation using Dynamic Compilers. Slides: Amanda.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell: Are there any novel components/techniques implemented by RUGRAT that make it stand out in comparison to other forms of fault injection?

Answer Outline: Based on the related work presented in the paper, I think the main contributions of the RUGRAT method for testing involve its level of automation, level of tester workload, and lack of changes made to code. Compared to the other approaches listed in the paper, RUGRAT seems to be the most automated. Other approaches require the tester to write stub functions or wrappers, thereby increasing the workload of the tester. In RUGRAT, the source code is analyzed and the tester only has to provide the locations he/she wants tested, the types of tests to create, and the oracles. RUGRAT does everything else. Also, other approaches require changes to the source code, which in my opinion means only bad things: why would I want to modify the source that I'm trying to test? So RUGRAT has several advantages compared to the approaches mentioned in the paper, but I'm unsure of its comparison to any approaches that have happened since.

2. Amanda Crowell:

3. Christopher Hayden: The paper mentioned that RUGRAT is capable of identifying sites where indirect function calls are made, but that it is not capable of distinguishing between calls to different functions from that site. In other words, if there are two functions that can possibly be called indirectly from a site, RUGRAT has no way of distinguishing, which one is called. The paper also mentioned that the authors are working on addressing this issue. It seems like it would be very difficult to distinguish between indirected functions at compile time without extensive and complicated static analysis. Do you have any sense for how they are going about accomplishing this?

Answer Outline:

4. Rajiv Jain: RUGRAT seems like it could be a very powerful tool. Where else do you think RUGRAT could be applied outside of resource and function pointer errors?

Answer Outline: The authors mention that they had a previous instantiation of RUGRAT that tested mechanisms that protect against stack "smashing" attacks. Stack attacks are, like resource and function pointer attacks, very difficult to do and simulate. Having something like RUGRAT makes simulating these attacks easier. I think the authors present RUGRAT as a method of test case generation, and are systematically going through various instantiations that approach different targets to demonstrate how the method can be used. They give a (brief) description of how RUGRAT is changed from instantiation to instantiation and list the following as future (possible) uses of RUGRAT:

- Test security policy enforcement techniques that restrict an applications access to certain resources

- Simulate an application that has been taken over by an attacker

- Implemented for java, handling specifics of language differences and exception raising semantics

5. Shivsubramani Krishnamoorthy: The main approach of the paper is using dynamic compilers for the testing. With the example mentioned in the paper, I get the feeling that the situations are more or less some form of exceptions, which need not be simulated dynamically. They can very well be predefined. The author has used the same example of memory availability in couple of his talks too. Do you think that it is a case of selecting a better example or is dynamic compiler really not required?

Answer Outline: I'm not entirely sure what you mean by "predefined", however, I assume you are meaning that the tester can predefine errors to occur in their tests. The reason for using the dynamic compilers is that (1) the tester doesn't have to know ALL the locations that the error can occur and (2) they don't have to make errors for all these locations. The dynamic compiler will locate all the places a certain error can occur and then just automatically create errors, which significantly reduces the effort the tester must put forth as well as reduces any bias the tester might have.

6. Srividya Ramaswamy

7. Bryan Robbins: The inputs used by the authors for code coverage analysis were either "distributed with the program or developed by the authors to provide adequate coverage for our analysis." However, in the case of subject applications from the SPEC and MiBench benchmarks, these "original inputs" were designed for performance evaluation and not test coverage. Is comparing their coverage of EH code to RUGRAT's coverage still valid?

Answer Outline:

8. Praveen Vaddadi: With regard to the "malloc" example in the paper, how different is the simulation of "out-of-memory" by RUGRAT from the buffer overflow scenario caused by the user himself (this could be automated too)? That is, if we use a "gets" function (which stores user's input into a buffer and continues to store until end of line or end of file is encountered) which can take any amount of user input data and thus causing the buffer overflow if the data is too large, will it be different from the buffer overflow simulation (setting "errno" to "ENOMEM")?

Answer Outline:

9. Subodh Bakshi

10. Samuel Huang: The observation made by the authors regarding the ``DynamoRIO dynamic compiler not considering any contextual information'' is interesting. If a tool were available that *did* consider such information, could we hope to do better, perhaps specifically for the pointer protection mechanism discussed? Perhaps some sort of information/data flow analysis could be combined to further improve performance?

Answer Outline: This question is similar to Chris' question, and I decided to address both here. First of all, I think that including contextual information would definitely be more complex and would almost certainly require more intensive static-type analysis. The authors implied that there were dynamic compilers that incorporate contextual information, but I do not know how these work and what effect they would have on RUGRAT functionality. In regards to performance, I think the effectiveness of having contextual information depends on the tester's goals and the types of functions under test. If you were testing functions that operate differently based on the call environment (i.e. the path that led to them being called and state of memory), then contextual information could be very useful and show problems that wouldn't be uncovered with the context. But if the functions under test don't really depend on the environment then the context may not matter. Also, if you add in contextual information, then the time required to analyze results of testing increases greatly (you'd have a result for every time a function was called). So it really depends…

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening: RUGRAT requires no modification to the source or binary program. So for those techniques for generating test cases, what are the disadvantages of those requiring no modification to the source, comparing with those requiring modifications? Will the former ones miss any information in source code?

Answer Outline:

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Fault Injection

[Nov. 12] Student presentation by Jonathan Turpie. G. A. Kanawati, N. A. Kanawati and J. A. Abraham, "FERRARI: A Flexible Software-Based Fault and Error Injection System," IEEE Transactions on Computers, vol. 44, no. 2, February 1995, pp. 248-260. Slides: Jonathan.pdf

Questions posed by students:

1. Nathaniel Crowell: As is the case for most forms of testing, how do we know when enough testing has been done with software fault injection?? Do any of the relevant papers provide some analysis of the costs (monetary and time) for testing with software fault injection?

2. Amanda Crowell: What kinds of processor features are used to inject faults using Xception?? What sort of techniques can be applied for processors without these features?

3. Christopher Hayden: In the presentation, you mentioned the use of hardware emulation coupled with fault injection to simulate hardware errors. Obviously emulation is not perfect or there would be no need to perform fault injection using actual hardware. How well do the results from hardware emulation and actual hardware correlate? If we test software using hardware emulation and fault injection, how certain can we be that the results from that test translate well into the hardware domain?

4. Rajiv Jain: The methods you presented seem to exclusively deal with transient hardware faults, but how does one cope with detecting and handling a permanent hardware fault that propagates through the error checking?

5. Shivsubramani Krishnamoorthy: Research in the field of fault injection has been focusing on the idea of processor independence. FERRARI requires the specification of processors instructions set. The authors of a fault injection tool - FITgrind claim that as a disadvantage of FERRARI. Do you really think that is a big disadvantage, in practical sense? The instruction set for a particular class of applications is going to remain constant. So, is that really an issue?

6. Srividya Ramaswamy

7. Bryan Robbins: We discussed that protocol fault injection may have some utility in the testing of component-based software. Could you describe how we might generate a "Class Not Found" case scenario via protocol fault injection?

8. Praveen Vaddadi: I understand that there are various advantages of hardware fault injection like Heavy-ion radiation method can inject faults into VLSI circuits at locations that are impossible to reach by other means and that experiments can be run in real time, allowing for the possibility of running a large number of fault-injection experiments, etc. What are the potential disadvantages of using hardware fault injection techniques? One disadvantage I could think of is that some hardware fault injection methods, such as state mutation require stopping and restarting the processor to inject a fault, it is not always effective for measuring latencies in the physical system. can you add more to this?

9. Subodh Bakshi

10. Samuel Huang: We could think of a coverage output and other related statistics as a fingerprint of a system. Could we devise a way to generate such a fingerprint a priori for a ``correct'' implementation of the system, and observe the effects such a fingerprint of introducing faults via fault injection? Perhaps if we store this expected ground truth finger print somehow, we could flag when modifications of a code base causes a radical change in program behavior.

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening: I agree that there are a lot of advantages of hardware fault injection, but what is the relation between hardware fault injection and software fault injection? Do you think they are complementary or one maybe replaced by the other?

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Database Testing II

[Nov. 17] Student presentation by Sonia Ng. A framework for testing database applications, David Chays, Saikat Dan, Phyllis G. Frankl, Filippos I. Vokolos, Elaine J. Weber, International Symposium on Software Testing and Analysis archive, Proceedings of the 2000 ACM SIGSOFT international symposium on Software testing and analysis, pages: 147-157 Slides: Sonia.pdf

Questions posed by students:

1. Nathaniel Crowell

2. Amanda Crowell: What are some of the unique issues associated with testing databases as compared to testing other types of software?

3. Christopher Hayden: We've seen a few papers thus far that treat database testing as unique. In each case, the cited reason is that the expected behavior of the database in response to an operation is very sensitive to the database state. However, such state sensitivity is not at all unique to databases; plugin based architectures are commonly used share this feature, the behavior of such an architecture depending on the active plugins at any given time. Considering this, can you draw a distinction, between database testing and testing of any other software that is state sensitive, that justifies the treatment of database testing as a separate topic? Or is it merely a special case of a larger class of software testing?

4. Rajiv Jain:

5. Shivsubramani Krishnamoorthy: The authors have focused mainly on a stand-alone DB system application. In section 1 they have listed a set of questions that need to be addressed. DBs are used, these days, more in a distributed fashion over web/networks. What additional questions/issues, do you think, need to be addressed to test a web application using distributed database systems?

6. Srividya Ramaswamy

7. Bryan Robbins: The authors argue that DB testing requires special consideration because "the input and output spaces include the database states as well as the explicit input and output parameters of the application." This seems to indicate that side-effects on the DB state are more important in DB testing, as opposed to other forms of event-driven testing (such as GUI testing) where only expected output is important. If we did consider application state information as an "output" in GUI testing, what are the potential tradeoffs?

8. Praveen Vaddadi: What potential problems do you see in performing regression tests for database applications? Various techniques for selecting subset of tests for re-execution have been proposed but they assume that the only factor that can change the result of a test case is the set of input values given for it, while all other influences on the behavior of the program will be constant for each re-execution of the tests. Clearly, this assumption would be impractical for testing database applications since in addition to the program state, the database state also needs to be taken into consideration. What other issues do you think will be problem-some in applying the naive test selection techniques for regression tests of database applications?

9. Subodh Bakshi

10. Samuel Huang: One major issue seems to be how decide the database's _state_, i.e. the set of tuples that populate a given schema. Strategies proposing synthetic data for this task seem to trade-off between producing a state that is _common_ in practice vs a state that helps to uncover faults - these two criterion oftentimes may not coincide. What types of strategies exist that focus on one criterion vs the other? Perhaps the realm of probabilistic databases may facilitate in generating ``common'' states, and also support correlations between tuples, etc.

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening: The tool includes four components and I think there is some overlap about the third and fourth component. For example, if we use "select", the output should be the selected items and state of the tables should keep the same, at this time, the third and fourth component are doing different things. But if we use "insert", the component 3 should check if it is inserted successfully, and the component 4 should check the state after inserted. At this time, component 3 and 4 are doing the similar things, and in fact, it is enough to just use component 4. It seems that the state can represent the output to some degree. So do you think that component 3 and 4 can be combined? If yes, do you have any sense to design new component 3 and 4 more efficiently?

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Automated Usability Testing/Evaluation

[Nov. 17] Student presentation by Ran Liu. (sources: (1) Automated Usability Testing Using HUI Analyzer, Simon Baker et al. 19^th Australian Conference on Software Engineering, 2008. (2) Automated Usability Testing Framework, Fiora T. W. Au et al. Proc. 9^th Australasian User Interface Conference (AUIC2008), Wollongong, Australia, 2008.). Slides: Ran.pdf

Questions posed by students:

1. Nathaniel Crowell: In describing the requirement for "form layout" the authors cite the claim that "the ideal sequence that is most commonly accepted is top left to bottom right". Wouldn't this actually vary based on the intended set of users? For instance, for users whose native language is written right to left, I'd expect that some right to left format would be the appropriate sequence. Should usability requirements instead be formed in the context of the intended users rather than trying to form some general set of guides that should apply in all situations?

2. Amanda Crowell: The authors of this paper stated that 6-8 test users are usually sufficient to identify most of the major usability issues. Our previous usability testing paper stated that 5 was the "magic number". Based on your reading and understanding of usability testing, do you have any idea what kind of factors are leading to the difference in number of users?

3. Christopher Hayden: The authors of the paper cite the phenomenon of psychological pollution, where users act differently due to the knowledge that they are being monitored, as a major threat to validity in usability testing. It seems that this effect may be impossible to eliminate in any formal testing environment. Are you aware of any research, or do you have any ideas, to address this threat in a way that still yields feedback of sufficiently high quality?

4. Rajiv Jain: It seems there are certain conventions when it comes to the aesthetics and layout of a GUI. Do you think machine learning techniques could be used to automatically detect metrics for these conventions and score a GUI based on its deviations from what is “normally seen”? Could this automate usability testing further?

5. Shivsubramani Krishnamoorthy

6. Srividya Ramaswamy

7. Bryan Robbins: As we discussed in class, although look-and-feel play a part, the user experience is often dominated by cognitive factors. Unfortunately, these are inherently hard to automate, but could fit in to a hybrid approach (like that of the HUI) which combines some user data with expectations. So what are some metrics that could represent cognitive aspects of the user experience such as accuracy and response time? How could we encode the new expectations into expected action sequences?

8. Praveen Vaddadi: I agree that it is impossible to fully automate the usability study. However, the user behavior can be mimicked by a ruled-based program that considers temporal aspects (as discussed in the class) as well. At this point, we need to ask whether are there reliable rules for user behavior on the web and if we really want to remove the users from the user test. What aspects do you think can never be automated in a usability study?

9. Subodh Bakshi

10. Samuel Huang: The HUI Analyzer's matching system looks similar to string matching algorithms. If not done yet, could we leverage that body of work to give more interesting types of alignments, such as partial matches, where ``partial'' here means approximate with some criterion? Would these partial matches be useful in the context HUI is working over?

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening: I understand that in the HUIA testing, the recorder will keep track of what the user has done and record them in an AAS, and then compare AAS with EAS. But how can we get EAS? In another way, how can we know what exactly the user want to do?

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Testing Mobile Applications

1. [Nov. 19] Student presentation by Ghanashyam Prabhakar. (sources: Hermes: A Tool for Testing Mobile Device Applications, 2009 Australian Software Engineering Conference, http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05076634). Slides: Ghanashyam.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell: How is requirement #3 accomplished (Consume minimal resources on mobile devices)?

2. Amanda Crowell: For the 6 requirements for Hermes set out in the paper, are there any that are more important than others?? In other words, if I had to choose between two different testing tools, neither of which met all the requirements, how would I know which one to pick?

Answer Outline: The authors, on the outset set quite high requirements, all of which I believe are important. But if I were to choose one requirement amongst the six they set, I feel the first one, i.e. "Automatically deploy applications, execute tests and generate reports" would be the more important. Because this requirement though very generically stated encompasses the basic requirements for automated mobile application testing. Also a tool that can address the problem of heterogeneity in mobile platforms would be considered better.

3. Christopher Hayden: The authors say something strange in passing in the abstract of the paper. Specifically, they state that Hermes is currently more expensive to employ than manual testing. This seems odd considering two facts. First, in the experiment presented, Hermes is superior to manual inspection in the verification of aesthetic characteristics of a GUI. This suggests that Hermes proficiency should increase relative to manual testing when less obvious flaws are considered. Second, tests written in Hermes at some up front cost can be executed and inspected quickly from that point forward, whereas tests inspected manually are always expensive. Do you feel that the authors' conclusion that Hermes is more expensive than manual testing is justified given the relatively simple experiment performed?

Answer Outline: From the authors' statement in the abstract, the keyword I would pick is "currently". Employing Hermes is definitely more expensive currently due to the fact that tests have to be manually written since the GUI Visual Modeler has not been designed/implemented as yet. So the upfront cost you talk about is what is being taken into account when the authors state that Hermes is currently more expensive. Of course having said that their relatively simple experiment could've been made more elaborate and they intend to do that in their future work.

4. Rajiv Jain: What are the biggest difference between testing a Mobile device and a normal PC? Are the lines getting blurrier with Mobile Operating Systems?

Answer Outline: With regards to mobile devices the main differences as compared to a PC would be the environment and the heterogeneity in mobile device platforms . So with regards to testing applications on mobile devices - the environment and testing across the heterogeneous mobile platforms would be priority.

5. Shivsubramani Krishnamoorthy

6. Srividya Ramaswamy

7. Bryan Robbins: On the PC subsystem side of the Hermes architecture, do you think it would be feasible to use existing tools for test scripting and execution? In other words, once we remove them from the heterogeneous underlying devices, do we expect mobile applications to be all that different from one another?

Answer Outline: You make a very good point, yes, they can very easily use existing tools for test scripting on the PC subsystem side. However for test execution the issue of device heterogeneity does come into play and this can be handled in the way the mobile subsystem is designed for each device.

8. Praveen Vaddadi: The paper discourages the use of device emulators for testing. However, if we are testing with real devices, their limited processing power and storage does not allow on-board diagnostic software to be loaded, so they lack instrumentation. With real devices, we will not be able to record the protocols going back and forth between the device and the application, and this will limit the ability to isolate problems and make corrections. Do you see this as a potential disadvantage?

Answer Outline: While black box testing of mobile applications, the question is do we really want to know what's going on behind the scene. I feel it's not much of a disadvantage because we are not really bothered to know what parameters are being passed or what protocol is being called to actually run that piece of application. When it comes to actually debugging the software on the mobile device to try and figure out what went wrong, well I guess it does handicap the developer a little. Probably the reason why the authors suggest future work to include developing probes allowing device data to be extracted that otherwise cannot be accessed using the J2ME API.

9. Subodh Bakshi

10. Samuel Huang: Configurations seems to play a large role in the challenges of programs passing (at least) the ``aesthetic'' type of tests. However, the ability to perform automated test validation on a set of tests of this type (even if we restrict ourselves to a narrower set than ``all' aesthetic tests) seems appealing, almost as if we are working towards an objective function to be used in optimization. Given that the GUIs are often described in XML, could we push this further and develop a search-like algorithm that considers different combinations of configuration parameters and arrives at arguably ``good'' results (namely a nice looking layout)?

Answer Outline: Well most definitely you could, but the problem here I see would be obtaining the various combinations of `configurations'. For example an application which passed all tests on a Nokia N97 might fail miserably on a newer version of Nokia N900 maybe due to the fact that Nokia N900 uses an advanced Linux based operating system called Maemo OR maybe even because of the way the application is rendered on the touch screen the way it is, so the point is, its quite difficult to model the various combinations. So I feel its best to actually run the test on the test device rather than trying to model its behavior. But IF someone were to actually try and build an emulator the solution you have presented definitely seems to be an optimal one.

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening: (1) I think Hermes is the automation of the manual testing for mobile device application. So if we use the same test cases, I guess the coverage will not change, but why the result of Hermes is much better than manual testing? Where are the improvements from? (2) Hermes took 25 min for the testers to specify the test cases, much longer than manual testing. So I think there must be some way to make it easier, for example, through using GUI to generate the xml for test cases automatically. So do you have any sense for how to do this?

Answer Outline: (1) I guess the measure of comparison is time taken for testing across various mobile devices, in that sense they say Hermes is better than manual testing. (2) Yes, I feel too that the GUI Visual modeler can be modeled based on some existing tools, for example like GUITAR.

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Test Prioritization

1. [Nov. 24] Student presentation by Hu, Yuening. (sources: Clustering test cases to achieve effective and scalable prioritization incorporating expert knowledge, Shin Yoo; Mark Harman; Paolo Tonella; Angelo Susi, Proceedings of the eighteenth international symposium on Software testing and analysis, 2009, Testing #2, Pages 201-212, DOI=http://portal.acm.org/citation.cfm?id=1572272.1572296&coll=GUIDE&dl=GUIDE&CFID=61019873&CFTOKEN=44062317). Slides: Yuening.pdf

Questions posed by students:

1. Nathaniel Crowell: How is the Analytic Hierarchy Process used in the clustering framework?

2. Amanda Crowell: Briefly describe the steps of the clustering framework presented in the paper (Clustering, Intra prioritization, inter prioritization, generate best order, evaluation).

3. Christopher Hayden: In the paper, the clustering is based on a code coverage metric (tests covering similar code are grouped together) and the intra-cluster ordering is also based on a code coverage metric. The only human input is the inter-cluster ordering. It seems that the real effect of arranging the tests in this way is to maximize the distance of consecutive tests, thereby increasing the amount of tested code quickly, and that the effect of ordering the clusters (i.e., human input) should be minimal. Suppose we greedily order the tests in such a way that we always execute the test that covers the most code that has not been covered by a previously executed test. Would you expect this arrangement of tests to perform similarly to the clustering technique?

4. Rajiv Jain:

5. Shivsubramani Krishnamoorthy

6. Srividya Ramaswamy

7. Bryan Robbins:

8. Praveen Vaddadi: The combination of AHP and ICP has been empirically evaluated and the results showed that this combination of techniques can be much more effective than coverage-based prioritisation. One surprising finding was that an error rate higher than 50%, i.e. human tester making wrong comparison half the time, did not prevent this technique from achieving higher APFD than coverage based prioritisation. How do you explain this unexpected finding? Are there more factors in addition to the effect of clustering influencing this result?

9. Subodh Bakshi

10. Samuel Huang: The authors state that the ideal similarity criterion that could be used for clustering is a ``similarity between faults detected by test cases, however this information is inherently unavailable before the testing is finished.'' They instead rely upon a Hamming distance measure between statement coverage of pairs of test cases. Is this a good measure? What other measure/information could we incorporate into our model to potentially improve the clustering?

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Regression Testing III

[Nov. 24] Student presentation by Christopher Hayden. (sources: Checking inside the black box: regression testing by comparing value spectra, Xie, T.; Notkin, D., Software Engineering, IEEE Transactions on, Volume: 31, Issue: 10 Oct. 2005, Page(s): 869- 883; Harrold, M. J., Rothermel, G., Wu, R., and Yi, L. 1998. An empirical investigation of program spectra. In Proceedings of the 1998 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering (Montreal, Quebec, Canada, June 16 - 16, 1998). A. M. Berman, Ed. PASTE '98. ACM, New York, NY, 83-90. DOI= http://doi.acm.org/10.1145/277631.277647; http://en.wikipedia.org/wiki/Profiling_%28computer_programming%29). Slides: Christopher.pdf

Questions posed by students: THIS SPEAKER IS NOT ACCEPTING NEW QUESTIONS

1. Nathaniel Crowell

2. Amanda Crowell:

3. Christopher Hayden

4. Rajiv Jain:

5. Shivsubramani Krishnamoorthy

6. Srividya Ramaswamy

7. Bryan Robbins:

8. Praveen Vaddadi: With regard to value spectra and function execution comparison, how does the comparison of values of pointer-type variables work?

Answer Outline: I will use the terms references instead of pointers. In value spectra analysis, the values of the references themselves are not checked. It would make no sense to do so as reference values vary due to memory allocation and are unlikely to be the same across multiple executions. However, the values of the referents are compared. Thus, when analyzing function entry/exit states, two states are equivalent if and only if all referents in the scope described in the paper are equivalent. In the case of dynamic data structures, in which one referent contains a reference to another referent, the analysis recurses to a depth determined given by a runtime parameter. Two states are equivalent if and only if all referents encountered in the recursion are equivalent.

9. Subodh Bakshi

10. Samuel Huang: The sensitivity of value spectra to code refactoring seems somewhat problematic. In particular, I would think that one wouldn't want the (arguably common) refactoring of (for example) one function into several smaller, more modular functions, should be considered a big change in the program flow. However, it seems like value spectra may consider it as such in terms of a distribution adjustment. Is there a way to post-process or otherwise further summarize existing value spectra that would make them more agnostic as to how specifically the program flow occurs, but still detects flaws in the general flow itself?

Answer Outline: Value spectra analysis closely resembles forms of dynamic data flow analysis. A lot of research has been done on dynamic data flow analysis and many useful heuristics have been developed. While there has been no research (as far as I can tell) on applying such heuristics to value spectra analysis, I suspect that they, or something similar to them, would be useful in improving the detection capability of value spectra by reducing the false positive rate. In general, questions arising in data flow analyses are undecidable. I would not be surprised if determining which changes within a value spectrum are relevant is undecidable in the general case. All of that said, there are of course more practical things that can be done, such as selectively instrumenting code to include only those functions that wrap a call to the modified function or to exclude the modified functions. However, selective instrumentation is not applicable in the general case.

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Software Compatibility Testing

[Dec. 1] Student presentation by Tandeep Sidhu. (sources: “Effective and Scalable Software Compatibility Testing” by Il-Chul Yoon, Alan Sussman. Atif M. Memon, and Adam Porter, in ISSTA '08: Proceedings of the International Symposium on Software Testing and Analysis, 2008.). Slides: Tandeep.pdf

Questions posed by students:

1. Nathaniel Crowell: What are the differences between exhaustive-cover configurations and the direct dependency-cover configuration described in the paper?

2. Amanda Crowell: Briefly describe the components of the Annotated Component Dependency Model (ACDM), the Component Dependency Graph (CDG) and Annotations. What relationships does an ACDM show?

3. Christopher Hayden: Focusing on direct dependencies seems like a natural approach. In a configuration where A depends on B and B depends on C, it should not matter to A what version C has provided B builds and executes correctly. Given that, if I'm a developer for project A, what interest do I have in looking at any dependencies at a distance greater than 1 from A in the dependency graph? Is there an example of a case in which a dependency at distance 2 from the SUT causes the SUT to fail, but does not cause the dependency at distance 1 to fail?

4. Rajiv Jain:

5. Shivsubramani Krishnamoorthy: There are a few commercial compatibility testing tools which I learnt about from the internet that claim to have lot many features supporting wide range of hardware an software. For example, Swatlab (http://www.swatlab.com/qa_testing/compatibility.html). How do you compare them with Rachet?

6. Srividya Ramaswamy

7. Bryan Robbins:

8. Praveen Vaddadi: How do the developers build the initial configuration space for the SUT? What about the components that indirectly influence the software under test?

9. Subodh Bakshi

10. Samuel Huang: It would be satisfying if we only needed to inspect the configurations of direct dependencies, as the paper describes. However, it might informative to, in a somewhat crude way, investigate whether or not the SUT (namely the root node of the CDG) implicitly uses information that is not supposed to be exposed by the dependency APIs (perhaps we assume that a function behaves in a certain way, or uses a certain data structure as a backend, when we are not explicitly given this in the utilized API). One way this may be violated is to investigate the effects of using variable configurations for the direct dependencies of the dependency itself, namely the second-level dependencies from the SUT. Could we detect such a violation of an API by observing the compilation (or perhaps some code execution?) of the SUT under such secondary dependency considerations?

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

Mutation Testing

[Dec. 1] Student presentation by Srividya Ramaswamy. (sources: Is Mutation an Appropriate Tool for Testing Experiments? J.H.Andrews, L.C.Briand, Y.Labiche). Slides: Srividya.pdf

Questions posed by students:

1. Nathaniel Crowell: What were the four mutant operators used by the researchers? Can you think of any additional operators that would be useful?

2. Amanda Crowell: The researchers for this paper sought to determine if mutation testing is an appropriate method for testing experiments. But what are the problems associated with finding test programs that mutation testing is meant to handle?

3. Christopher Hayden: In the paper, the authors filter out mutants that produce no detectable change from the oracle. Later they argue, based on results from the Space program, that mutants seem to introduce that are detected at a rate similar to that of real faults and that faults introduced by hand are harder to detect than real faults. Are these results valid in light of the filtering? What use are they in actual software development where we don't have a pristine program to compare to? We saw in an earlier presentation that faults do not always propagate to output. Isn't is possible that mutants that did not yield detectable changes from the oracle more closely resemble the faults introduced by hand and by filtering them the authors bias their data toward easily detectable mutations?

4. Rajiv Jain:

5. Shivsubramani Krishnamoorthy: I understand an equivalent mutant is the one which is syntactically different from the original program but behaviorally same. Thus, detecting and killing such mutants is very difficult. Do you agree that the process of filtering out equivalent mutant is an overhead, or is the extra step worth taking?

6. Srividya Ramaswamy

7. Bryan Robbins:

8. Praveen Vaddadi: I understand from the paper that while there is probably some minimum number and type of mutation operators that are required to achieve a benefit from mutation analysis, no existing mutation operator set is exhaustive. Therefore, if the technique is to be used by software testers, then mutation speed and efficiency should be chosen over the number or type of mutation operators. Is this conclusion valid?

9. Subodh Bakshi

10. Samuel Huang: Before analyzing the results of an experiment using mutation testing, the decision of mutation operators seems of prime importance. Given that there are existing tools which have a large ``grab bag'' of different classes of mutation operators, could we somehow correlate these operators with the ability to reproduce classes of *faults*? Formally this may take the form of clustering sets of mutation operators with the knowledge of their fault-type causing tendencies. The relational clustering literature, in particular applied to NLP type problems, does work on this, which is sometimes referred to as biclustering or co-clustering. Such a clustering may allow us to better select _candidate_ mutation operators to satisfy some notion of fault-type coverage, or perhaps to better approximate real-life faults that are encountered.

11. Ghanashyam Prabhakar

12. Jonathan Turpie

13. Cho, Hyoungtae:

14. Hu, Yuening

15. Liu, Ran:

16. Ng Zeng, Sonia Sauwua

17. Sidhu, Tandeep

18. Guerra Gomez, John Alexis:

[Dec. 3] Discussion of EasyMock http://easymock.org/ by Nathaniel and Amanda Crowell.
Regression Testing IV

[Dec. 8] Automatically Repairing Event Sequence-Based GUI Test Suites for Regression Testing by Atif M. Memon. ACM Trans. on Softw. Eng. and Method., 2008, ACM Press.

Test Oracles

Contents: Modeling oracles

[Dec. 8] General discussion and overview. Slides: 9.pdf
[Dec. 8] Qing Xie and Atif M. Memon, Designing and comparing automated test oracles for GUI-based software applications, ACM Transactions on Software Engineering and Methodology, vol. 16, no. 1, 2007.

Test Coverage

[Dec. 10] “GUI Interaction Testing: Incorporating Event Context” by Xun Yuan, Myra B. Cohen, and Atif M. Memon, IEEE Transactions on Software Engineering, vol. xx, no. x, 2010, IEEE Computer Society.

Student Presentations: All students are strongly encouraged to present a topic related to software testing. You must prepare slides and select a date for your presentation. Group presentations are encouraged if the selected topic is broad enough. Topics include, but are not limited to:

o Combining and Comparing Testing Techniques

o Defect and Failure Estimation and Analysis

o Testing Embedded Software

o Fault Injection

o Load Testing

o Testing for Security

o Software Architectures and Testing

o Test Case Prioritization

o Testing Concurrent Programs

o Testing Database Applications

o Testing Distributed Systems

o Testing Evolving Software

o Testing Interactive Systems

o Testing Object-Oriented Software

o Testing Spreadsheets

o Testing vs. Other Quality Assurance Techniques

o Usability Testing

o Web Testing

Course Project (System, Unit, Integration, and Continuous Testing of four GUITAR Apps)

[NOTE: This is the standard project for this class. If you don’t want to do this project, you are welcome to propose your own. I have some project ideas too; feel free to stop by and discuss. We will have to agree on a project together with a grading policy.]

Summary: GUITAR (http://guitar.sourceforge.net/) is a system consisting of numerous applications. In this class project, we will test four of GUITAR’s applications (JFCGUIRipper, GUIStructure2EFG, TestCaseGenerator, JFCGUIReplayer) using conventional testing techniques. We will also setup a continuous testing process for these applications using the Hudson (http://hudson.dev.java.net/) software. The black-box and white-box test cases, which you will design, for this process will consist of (1) large system tests that cover most of the functions of an application, (2) small system tests, each targeting a specific part of the application; all these tests together cover most of the functions of the application, and (3) Unit tests.

The project will consist of four phases. Phases 1 and 4 are to be done individually; phases 2 and 3 are to be done in teams, each consisting of 3-4 students.

Team

Members

Phase 1

Demo

Date/Time

Phase 2

Demo

Date/Time

One hour each

Phase 3

Demo

Date/Time

One hour each

Phase 4

JFCGUIRipper

Nathaniel Crowell

Amanda Crowell

Christopher Hayden

Rajiv Jain

(09/10/09)/11:30AM

(09/10/09)/11:00AM

(09/11/09)/1:00PM

(09/10/09)/1:00PM

Thu. 10/15: 2:00pm

Rajiv Jain

GUITARModel

JFCModel

GUIRipper

JFCRipper

GUIStructure2EFG

Shivsubramani Krishnamoorthy

Srividya Ramaswamy

Bryan Robbins

Praveen Vaddadi

(09/10/09)/12:00PM

(09/10/09)/2:30PM

(09/10/09)/12:00PM

(09/10/09)/12:30PM

Thu. 10/15: 3:00pm

Praveen Vaddadi

GUIStructure2Graph-Plugins/GUIStructure2EFG

(GUIReplayer + JFCReplayer)

(GUIStructure2GraphConvert + GUIStructure2Graph-Plugins/GUIStructure2EFG)

GUIStructure2GraphConvert

TestCaseGenerator

Subodh Bakshi

Samuel Huang

Ghanashyam Prabhakar

Jonathan Turpie

(09/11/09)/1:30PM

(09/11/09)/3:00PM

(09/11/09)/3:30PM

(09/11/09)/10:00AM

Wed. 10/14: 3:00pm

Jonathan Turpie

(TestCaseGenerator + RandomTestCase + SequenceLengthCoverage)

(GUIRipper+JFCRipper)

TestCaseGenerator-Plugins/RandomTestCase

JFCGUIReplayer

Cho, Hyoungtae

Hu, Yuening

Liu, Ran

Ng Zeng, Sonia Sauwua

Sidhu, Tandeep

(09/11/09)/10:00AM

(09/11/09)/12:00PM

(09/11/09)/12:30PM

(09/10/09)/2:30PM

(09/11/09)/9:30AM

Wed. 10/14: 9:00am

Ran Liu

(GUITARModel+JFCModel)

TestCaseGenerator

GUIReplayer

JFCReplayer

TestCaseGenerator-Plugins/SequenceLengthCoverage

Independent

Guerra Gomez, John Alexis

(09/15/09)/11:00AM

-NA-

Phase 1

Goal: Downloading and running all the applications.

Procedure: In this phase, you need to demonstrate that you are able to run all four applications (please see your course instructor for the inputs to these applications). The source code for all these applications resides in a Subversion repository that can be viewed at http://guitar.svn.sourceforge.net/viewvc/guitar/. Please refer to https://sourceforge.net/projects/guitar/develop for additional help with Subversion. Each application (lets call it foo for fun) resides in the folder foo.tool/ in the repository. Checkout this folder, for each application, and read its README.txt file. You will find more information about the modules that foo uses for building and execution. You will also find instructions on how to build and run foo.

Deliverables: There are no deliverables for this phase. Points will be awarded for demos.

Grading session: During the grading session, you will have access to a new machine (Windows or Linux) that is connected to the Internet. You will have to download, build, and execute all four applications and run them on all the inputs. You will also need to install Ant, Subversion, Java, and any other tools needed to run the applications. Each application is worth 10 points.

Points: 40

Due on or before Sep. 14 by 2:00pm. [Late submission policy – you lose 20% (of the maximum points) per day]

Phase 2

Goal: Setting up the continuous building and regression testing process using large system tests.

Procedure: This phase consists of working with Hudson to automate what you did in Phase 1. In addition, you will evaluate the coverage of the inputs, document them as test cases, and improve the inputs if needed. Form teams of 3-4 students and take ownership of one of the applications.

You will need to write an Ant script that automatically builds your application. Assume that the Ant script will be run on a new Windows or Linux machine that has only Ant and Java installed on it. Submit the Ant script for addition to the foo.tool/ folder in our Subversion repository. Configure Hudson to run this Ant script periodically. Then use the input files from Phase 1 to develop large system tests that will be stored in the foo.tool/tests/large-system-tests/ folder in the repository. Configure Hudson to run these tests periodically and show the results of tests passing/failing. Configure Hudson to generate Emma code coverage reports for these tests and make them available for viewing over the web. Ensure that you get 100% statement coverage from these tests. Add more large system tests to increase coverage if needed. Or provide evidence that certain parts of the code cannot be executed using any system tests. Finally, develop test oracles for reference testing against the application’s previous build. Finally, document the test cases using the wiki tool – describe the inputs and expected outputs.

Deliverables: Your instructor will view the output of your tests, coverage reports, etc. online on Hudson. The documentation should be submitted via the wiki pages. You are advised to meet your instructor regularly (e.g., twice a week) to show progress for each milestone (e.g., Team created, application selected, Ant script done, Hudson configured to run Ant script, first system test automated, wiki page started, etc.).

Grading session: During the grading session, you will need to show everything up and running on our Hudson server (30 points), including the test and coverage reports. Your instructor will seed up to 10 artificial faults, one at a time, in the application under test. For each individual seeded fault, your test cases will be re-run via Hudson to see if they detect the fault. 10 points will be awarded for each fault detected by your test cases, for a maximum of 50 points. The wiki documentation of the test cases is worth 30 points.

Points: 110

Due on or before Oct. 5 by 2:00pm. [Late submission policy – the entire team will lose 20% (of the maximum points) per day]

Phase 3

Goal: Designing small, targeted system tests and integrating them into the continuous testing process.

Procedure: This phase requires the development of new carefully crafted inputs for which output can be predicted and created by hand. These new inputs will be very different from the ones used in Phases 1 and 2 – you will see that the output from the previous inputs was 100’s of MB, which was impossible to verify manually. The new inputs developed in this phase will be tiny Java swing programs that exercise select parts of your application. For each test, you will manually develop test oracles that check whether the application performed as expected.

Submit these tests so that they can be stored in the foo.tool/tests/small-system-tests/ folder in the repository. Configure Hudson to run these tests periodically and show the results of tests passing/failing. Configure Hudson to generate Emma code coverage reports for these tests and make them available for viewing over the web. Ensure that you get 100% statement coverage from these tests. Add more small system tests to increase coverage if needed. Or provide evidence that certain parts of the code cannot be executed using any system tests. Finally, document the test cases using the GUITAR wiki – describe the inputs and expected outputs (this document is much more detailed than the one developed in Phase 2).

Grading session: During the grading session, you will need to show everything up and running on our Hudson server (30 points), including the test and coverage reports. Your instructor will then seed up to 15 artificial faults, one at a time, in the application under test. For each individual seeded fault, your test cases will be re-run via Hudson to see if they detect the fault. 10 points will be awarded for each fault detected by your test cases, for a maximum of 100 points. The wiki documentation of the test cases is worth 70 points.

Points: 200

Due on or before Nov. 9 by 2:00pm. [Late submission policy – the entire team will lose 20% (of the maximum points) per day]

Phase 4

Goal: Designing and deploying unit tests.

Procedure: So far, all your tests have been system tests. In this phase, you will develop unit tests (e.g., using JUnit) for the modules used by the tools. Remember that this phase is done individually, so select a module quickly. For each test, you will develop inputs as well as test oracles (using assert statements in the test case) that check whether the unit performed as expected on the inputs.

Submit these tests so that they can be stored in the tests/ folder of the relevant module in the repository. Configure Hudson to run these tests periodically. Configure Hudson to generate Emma code coverage reports for these tests and make them available for viewing over the web. Ensure that you get 100% statement AND BRANCH coverage (use Cobertura http://cobertura.sourceforge.net/) from these tests. Or provide evidence that certain parts of the code cannot be executed using any unit tests. Finally, embed Javadoc comments in the JUnit tests so that Javadoc can be generated automatically for them.

Deliverables: Your instructor will view the output of your tests, coverage reports, etc. online on Hudson. The documentation should be submitted via the Javadoc pages. You are advised to meet your instructor regularly (e.g., twice a week) to show progress for each JUnit test that you develop.

Grading session: During the grading session, you will need to show everything up and running on our Hudson server (30 points), including the test and coverage reports. Your instructor will then seed up to 10 artificial faults, one at a time, in the module under test. For each individual seeded fault, your test cases will be re-run via Hudson to see if they detect the fault. 10 points will be awarded for each fault detected by your test cases, for a maximum of 50 points. The Javadoc documentation of the test cases is worth 70 points.

Points: 150

Due on or before Dec. 1 by 2:00pm. [Late submission policy – you lose 20% (of the maximum points) per day]