Evaluation Resources

This page describes how we evaluated the reading techniques. Anyone interested in evaluating them in a different environment may still be interested in using some or all of our resources.

We ran our study as part of a software engineering course at the University of Maryland. Our class of upper-level undergraduates and graduate students was divided into 15 two- and three-person teams. Teams were chosen randomly and then examined to make certain that each team met certain minimum requirements (e.g. no more than one person on the team with low C++ experience) for the class. Each team was asked to develop an application during the course of the semester, going through all stages of the software lifecycle (interpreting customer requirements into object and dynamic models, then implementing the system based on these models). The application to be developed was one that would allow a user to edit OMT-notation diagrams [Rumbaugh91]. That is, the user had to be able to graphically represent the classes of a system and the different types of relations among them, to be able to perform some operations (e.g. moving, resizing) directly on these representations, and to be able to enter descriptive attributes (class operations, object names, multiplicity of relations, etc.) that would be displayed according to the notational standards. The project was to be built on top of the ET++ framework [Weinand89].

Since the analysis was carried out both for individuals and the teams of which they were part, we were able to treat the study as an embedded case study [Yin94]. Over the course of the semester, we used a number of different methods to collect a wide variety of data, each of which we discuss briefly below. Most of our collection methods are mentioned by Singer and Lethbridge in their discussion of the pros and cons of various methods for studying maintenance activities [Singer96], and we respond to some of their comments where appropriate. We hope this study provides an additional illustration of their conclusion that, in order to obtain an accurate picture of the work involved, a variety of methods must be used at different points of the development cycle in order to balance out the advantages and disadvantages of each method.

Questionnaires were used at the beginning (to report previous programming experience) and end (to report effort spent during the last week of implementation and the level of completion for each functional requirement for the project) of the semester. Although the information reported on the beginning questionnaires could not be verified, the end questionnaires were verified against the executables submitted for each team. The unit of analysis for the beginning questionnaires was the individual student, while the end questionnaires were filled out for the entire team. Both were mandatory although self-reported, and did not impact the students' grades.
Exam grades were recorded for certain questions on the midterm that could be used by us to gauge the students' level of understanding of framework concepts. These grades were recorded for the individual students, were assigned by us after evaluating the students' responses for the level of understanding exhibited, and were mandatory (as they constituted part of the students' grades for the course).
Progress reports were to be submitted by each team for each week of the implementation phase. They consisted of an estimate of the number of hours worked by the team for the week in implementing the project, and a list of which functional requirements had been begun and which completed. As the students were told that the progress reports had no bearing on their grades, many teams opted to submit them only sporadically or not at all. (In some ways, these reports were similar to Singer and Lethbridge's idea of logbooks, which allow the developer to record information at certain times throughout the development process. Singer and Lethbridge concentrate on the dangers of making the report too time-consuming, but we have noticed a quite opposite phenomenon: if the experimenter makes the report too minimal, the developer may assume that the information to be collected cannot be truly important and thus make completing the report a very low priority.)
Problem reports were requests for clarification or for help with ET++ that the students submitted (via email) to the course instructors. A record was kept of the general subject of each request, and by which team it had been submitted. In this way we hoped to maintain a record of the kinds of difficulties encountered by teams during the course of the project. Problem reports were obviously not mandatory and had no effect on student grades, but were a resource that the students knew could be made use of at their discretion. (Singer and Lethbridge focus on the inaccuracies of retrospective reports, but our problem reports were actually an excellent way to get an accurate picture of where teams were having problems at the time they were having them - which may be, admittedly, unique to the classroom environment.)
Implementation score was assigned by us to each team at the end of the semester. Projects were graded by assessing how well the submitted system met each of the original functional requirements (on a 6 point scale based upon the suggested scale for reporting run-time defects in the NASA Software Engineering Laboratory [SEL92]: "required functionality missing", "program stops when functionality invoked", "functionality cannot be used", "functionality can only partly be used", "minor or cosmetic deviation", "functionality works well"). The score for each functional requirement was then weighted by the subjective importance of the requirement (assigned by us) and used to compute an implementation score that reflects the usefulness and reliability of the delivered system.
Final reports were collected from each team at the end of the semester. These reports consisted of documentation for the submitted system (object models and use cases) as well as records of the activities undertaken while implementing the project (object models of examples that had been studied, lists of classes that had been examined for some functionalities). Additionally, in-class presentations were given by each team in which they could present interesting details of the functionality available in their system, their experiences and difficulties with the techniques they used, and/or their general approach to implementation. The completeness of the final reports counted toward each team's grade, although their conformance to any particular technique did not.
Self-assessments were mandatory ratings in which each student was asked to rate the effectiveness of each member of his or her team (including him- or herself) as well as the team performance as a whole. Partly this was to detect if every team member had done their share of the work, and partly it was to ask students to think about what they had done rightly and wrongly during the course of the implementation. Although it was mandatory that each student return a self-assessment, they did not count directly toward the student grade (although in some cases, evidence from the self-assessments and the interviews led to individual grades being slightly adjusted).
Interviews were mandatory "debriefing" sessions at the end of the semester. Each team would come as a group to the course instructors, to be asked questions about what kinds of activities they did during the course of the semester, which of these they found particularly useful or useless, and what parts of the project were easiest and hardest. The original list of interview questions is included, although additional questions were conducted in a dynamic manner. That is, the course of the interview was directed in new directions by us as unforeseen but interesting themes were raised.

Our findings are included in the experiences section of this lab package.

Web Accessibility