Projects of the Experimental Software Engineering Group at the University of Maryland

Empirical Modeling:

Optimized Set Reduction


Our modeling approach, Optimized Set Reduction (OSR), is based on both statistical and machine learning principles.Given a historical data set, OSR automatically generates (through a search algorithm) a collection of logical expressions referred as patterns which characterize the trends observable in the data set. Patterns provide interpretable models where the impact of each predicate can be easily evaluated. For each pattern generated by OSR, a reliability of prediction and a statistical significance are calculated based on the learning set.

Validation Strategy

We demonstrate the effectiveness of the approach by applying OSR to different problems, e.g., cost prediction, fault-prone components identification, change management.
Different modeling approaches have been compared and evaluated based on data from the Software Engineering Laboratory (SEL).

Project Status

Active - Contact Lionel Briand or Khaled El-Emam for current activities.



Algorithms to generate patterns and merge similar patterns according to a user defined degree of similarity have been designed.
A prototype has been built, and a commercial tool has been developed by a software company in France.


First, we demonstrated the effectiveness of the approach by applying OSR to the problem of cost estimation. The OSR predictions were compared to predictions from two other effort estimation techniques to provide a basis for evaluation. The data set for the study came from the COCOMO database and the fifteen projects used by kemere for cost modeling evaluation.
Then, we validated the approach by applying OSR constructing a predictive effort model for changes during the maintenance phase. We used a data set describing several projects in the SEL at the NASA Goddard Space Flight Center (GSFC).
We also validated the OSR appraoch by comparing it to standard modeling approaches currently used for classification such as logistic regression and classification trees. The approaches were applied to early identify the high-risk system components. The data set included 146 components of a 260 KLOC Ada system from the NASA/GSFC Flight Dynamics Division.

Top Level Page

<-Back to ESEG Home Page
Last updated: April 2, 1998 by Filippo Lanubile

Web Accessibility