Evaluation Resources
This page describes how we evaluated the reading techniques. Anyone
interested in evaluating them in a different environment may still be interested
in using some or all of our resources.
We ran our study as part of a software engineering course at the University
of Maryland. Our class of upper-level undergraduates and graduate students
was divided into 15 two- and three-person teams. Teams were chosen randomly
and then examined to make certain that each team met certain minimum requirements
(e.g. no more than one person on the team with low C++ experience) for
the class. Each team was asked to develop an application during the course
of the semester, going through all stages of the software lifecycle (interpreting
customer requirements into object and dynamic models, then implementing
the system based on these models). The application
to be developed was one that would allow a user to edit OMT-notation
diagrams [Rumbaugh91]. That is, the user had to be able to graphically
represent the classes of a system and the different types of relations
among them, to be able to perform some operations (e.g. moving, resizing)
directly on these representations, and to be able to enter descriptive
attributes (class operations, object names, multiplicity of relations,
etc.) that would be displayed according to the notational standards. The
project was to be built on top of the ET++ framework
[Weinand89].
Since the analysis was carried out both for individuals and the teams
of which they were part, we were able to treat the study as an embedded
case study [Yin94]. Over the course of the semester, we used a number of
different methods to collect a wide variety of data, each of which we discuss
briefly below. Most of our collection methods are mentioned by Singer and
Lethbridge in their discussion of the pros and cons of various methods
for studying maintenance activities [Singer96], and we respond to some
of their comments where appropriate. We hope this study provides an additional
illustration of their conclusion that, in order to obtain an accurate picture
of the work involved, a variety of methods must be used at different points
of the development cycle in order to balance out the advantages and disadvantages
of each method.
- Questionnaires were used at the beginning (to report previous programming
experience) and end (to report effort spent during the last week of implementation
and the level of completion for each functional requirement for the project)
of the semester. Although the information reported on the beginning questionnaires
could not be verified, the end questionnaires were verified against the
executables submitted for each team. The unit of analysis for the beginning
questionnaires was the individual student, while the end questionnaires
were filled out for the entire team. Both were mandatory although self-reported,
and did not impact the students' grades.
- Exam grades were recorded for certain questions on the midterm that
could be used by us to gauge the students' level of understanding of framework
concepts. These grades were recorded for the individual students, were
assigned by us after evaluating the students' responses for the level of
understanding exhibited, and were mandatory (as they constituted part of
the students' grades for the course).
- Progress reports were to be submitted by each team for each week of
the implementation phase. They consisted of an estimate of the number of
hours worked by the team for the week in implementing the project, and
a list of which functional requirements had been begun and which completed.
As the students were told that the progress reports had no bearing on their
grades, many teams opted to submit them only sporadically or not at all.
(In some ways, these reports were similar to Singer and Lethbridge's idea
of logbooks, which allow the developer to record information at certain
times throughout the development process. Singer and Lethbridge concentrate
on the dangers of making the report too time-consuming, but we have noticed
a quite opposite phenomenon: if the experimenter makes the report too minimal,
the developer may assume that the information to be collected cannot be
truly important and thus make completing the report a very low priority.)
- Problem reports were requests for clarification or for help with ET++
that the students submitted (via email) to the course instructors. A record
was kept of the general subject of each request, and by which team it had
been submitted. In this way we hoped to maintain a record of the kinds
of difficulties encountered by teams during the course of the project.
Problem reports were obviously not mandatory and had no effect on student
grades, but were a resource that the students knew could be made use of
at their discretion. (Singer and Lethbridge focus on the inaccuracies of
retrospective reports, but our problem reports were actually an excellent
way to get an accurate picture of where teams were having problems at the
time they were having them - which may be, admittedly, unique to the classroom
environment.)
- Implementation score was assigned by us to each team at the end of
the semester. Projects were graded by assessing how well the submitted
system met each of the original functional requirements (on a 6 point scale
based upon the suggested scale for reporting run-time defects in the NASA
Software Engineering Laboratory [SEL92]: "required functionality missing",
"program stops when functionality invoked", "functionality
cannot be used", "functionality can only partly be used",
"minor or cosmetic deviation", "functionality works well").
The score for each functional requirement was then weighted by the subjective
importance of the requirement (assigned by us) and used to compute an implementation
score that reflects the usefulness and reliability of the delivered system.
- Final reports were collected from each team at the end of the semester.
These reports consisted of documentation for the submitted system (object
models and use cases) as well as records of the activities undertaken while
implementing the project (object models of examples that had been studied,
lists of classes that had been examined for some functionalities). Additionally,
in-class presentations were given by each team in which they could present
interesting details of the functionality available in their system, their
experiences and difficulties with the techniques they used, and/or their
general approach to implementation. The completeness of the final reports
counted toward each team's grade, although their conformance to any particular
technique did not.
- Self-assessments were mandatory ratings in which each student was asked
to rate the effectiveness of each member of his or her team (including
him- or herself) as well as the team performance as a whole. Partly this
was to detect if every team member had done their share of the work, and
partly it was to ask students to think about what they had done rightly
and wrongly during the course of the implementation. Although it was mandatory
that each student return a self-assessment, they did not count directly
toward the student grade (although in some cases, evidence from the self-assessments
and the interviews led to individual grades being slightly adjusted).
- Interviews were mandatory "debriefing" sessions at the end
of the semester. Each team would come as a group to the course instructors,
to be asked questions about what kinds of activities they did during the
course of the semester, which of these they found particularly useful or
useless, and what parts of the project were easiest and hardest. The original
list of interview questions is included, although additional questions
were conducted in a dynamic manner. That is, the course of the interview
was directed in new directions by us as unforeseen but interesting themes
were raised.
Our findings are included in the experiences section
of this lab package.
Web Accessibility