Understanding Software Evolution

Mining software repositories at the source code level can provide a greater understanding of how software evolves. Hoewever, lots of software engineering studies on software evolution use metrics that are too coarse-grained, such as project size, file count, class or function count. Renaming a type or function, or moving code across files does not affect the semantics, but using the aforementioned metrics, the changes appear significant.

We have developed a tool for quickly comparing the source code of different versions of a C program. The approach is based on partial abstract syntax tree matching, and can track simple changes to global variables, types and functions. These changes can characterize aspects of software evolution useful for answering higher level questions, for instance how types evolve, how functions evolve, what kinds changes are more frequent than others, etc.

Another useful feature of our tool is that it can provide a "release digest" for explaining changes in a software release: what functions or variables have changed, where the hot spots are, whether or not the changes affect certain components, etc. Typical release notes can be too high level, informal and incomplete. At the other extreme, comparing releases by inspecting the "diff" output is often too low level, and doesn't scale very well.

Our approach and case studies on Apache, OpenSSH, Vsftpd, BIND and the Linux kernel are presented in detail in our position paper : Understanding Source Code Evolution Using Abstract Syntax Tree Matching.

We have begun to extend the tool beyond matching ASTs, to measure evolution metrics such as common coupling or cohesion. We are interested in analyzing more programs, with the hope that the tool can be usefully applied to shed light on a variety of software evolution questions.

Send an email to Iulian Neamtiu if you'd like to get our tool.