CMSC 631, Fall 2004
Program Analysis and Understanding
Projects
Guidelines
Students will be expected to complete a substantial research project
during the semester on a topic related to the class. Projects may be
completed individually or in groups of at most 2. The expectations
for group projects will be twice that of individual projects.
Students will write up the results of their project and give a brief
presentation to the class and other interested parties at the end of
the semester.
Dates
- Everyone in the class needs to tell me what you plan to do. You
must write a one-page project proposal and e-mail it to me by
Thursday, October 21. You may also submit your proposal in
writing rather than by e-mail, in which case you should hand it to me
at the beginning of class that day.
- Project presentations will be held during class after Thanksgiving
break. If necessary, we made need to also schedule some other times.
- The final project write-up and any implementation is due at
midnight on Friday, December 17. You must submit your project
via e-mail to me.
Kinds of Projects
Projects on any subject related to the topics covered in class are
acceptable. If you are taking two classes that require projects, then
with the agreement of the other instructor you may be able to do a
single (larger) combined project for both classes. There are two main
kinds of projects: a survey project and a research project.
The Survey Project
- Pick an area in which you are interested
- Read thoroughly 3-6 papers, and read at least superficially 3-6
other papers. I will help, but you will need to do work to find the
papers.
- Write a survey of what you have learned:
- What are the basic problems?
- What are the basic approaches to solving them?
- What are the main achievements to date?
- What are the open problems?
- Keep the scope narrow enough so you can say something interesting
The Research Project
In general, I recommend that you try to do a research project.
Usually research projects involve some survey work and some
implementation work. There are many kinds of research projects,
including the following:
- Design and implement a program analysis
- Try to formalize some interesting aspect of an existing language
- Invent a new language or new language features, to make program
analysis and understanding of that language more tractable
Doing a good research project is harder than doing a survey, but
ultimately much more rewarding. If you want to do a research project
in an area you are not yet sufficiently familiar with, you should
start out with a brief survey project and turn it into a research
project once you know more about the area. One important caveat:
While your research project may be very open ended, be sure that you
have a well-defined goal for the end of the semester so that you have
something to write-up and present.
The Paper
Your write-up at the end of the semester should be in the form of a
research paper. You should include an abstract and an introduction
summarizing your motivation and your accomplishments. The
intermediate sections should contain full details about what you did.
End with a conclusion putting the project in perspective and
mentioning open problems, and a bibliography of cited papers.
Research papers should also include a related work section. Your
write-up should be 5-15 pages, as necessary.
The paper should be typeset and made available in electronic form.
The Presentation
At the end of the semester, you will give a presentation about your
project. Note that after the presentation you may continue to work on
your project, so it is okay if your project is not quite complete. If
we can manage to fit everyone in, presentations will be 20 minutes
long, plus 5 minutes for questions. I recommend preparing slides for
your talk. Depending on how long it takes you per slide, you may want
anywhere from 8-15 slides; try your talk out ahead of time to see how
long it takes.
Suggestions for Research Projects
(Expect this list to grow in the next couple of weeks. Last update 9/19/04.)
- CQual includes a
flow-sensitive analysis, but it's sometimes difficult for
programmers to use because the alias analysis is somewhat conservative
and can lead to unexpected results. Currently CQual includes restrict
to help improve the situation, but that can be tricky too. There are
two potential projects in improving flow-sensitivity in CQual: One
project would be to modify restrict so that it's flow-sensitive; this
would make it much simpler to use and would eliminate the complexity
of restrict inference. Another project would be to implement
flow-sensitivity in a completely different way by using
sub-naming. The idea here is that each time there is a write
to memory, the lhs of the assignment is treated as a new name that
contains the updated value. Subsequent reads to that particular name
are precise. An alias analysis is used so that subsequent reads
through other aliases of that name get an imprecise (weakly-updated)
value. That's a very quick sketch of the idea, and I can provide more
deatils if you are interested.
- In some unpublished work (see Types for
Lexically-Scoped Access Control, a tech report on my web page), we
developed a new statically-checkable access control system that is
considerably richer than Java's. The system is given formally as a
set of type rules, and the goal of this project would be to actually
implement the system for Java.
- The program analyses we have seen in class, and many of the ones
that have been proposed in the research literature, determine fairly
low-level properties of code: whether an expression is constant or
not, which expressions may alias, or whether a series of functions is
called in the correct sequence. In contrast, programmers are often
very concerned with high-level properties of their code: in the 90's,
there was a lot of excitement about design patterns and how they help
in the design and architecture of programs. I believe that it is
possible to check high-level design patterns using some of the
low-level techniques researchers have been developing. The goal of
this project would be to explore this connection, and to implement a
kind of type system for certain design patterns. There are lots of
possibilities here, so this could easily be more than one project.
- We (myself, Mike Hicks, and Polyvios Pratitakis) are developing a
static system for finding race conditions in C programs that use POSIX
threads. We would like to combine this with a system that performs
dynamic
annotation inference. This project would be to develop such a
system. We already have an Eraser-like tool for POSIX threads in C,
so the goal would be to extend that to infer annotations of a kind we
will specify.
- The label-flow analysis that you read about the polymorphism/CFL
paper is flow-insensitive. One interesting project would be to
develop a flow-sensitive version of the algorithm. This might or
might not be related to the project of improving flow-sensitivity in
CQual.
- We've been experimenting lately with extensions to polymorphism
via CFL, and one of the more awkward problems is simply drawing the
graphs for complicated examples, which we need to do in order to
understand how to extend the closure algorithm appropriately. This
project would be to develop a tool for constructing and drawing
pictures of the constraint graphs generated from the language in the
CFL paper (with some extensions), perhaps using the graphviz
package.
- Dynamic Memory Management. Garbage collectors free objects only
when they can never again be used by the program, based on the notion
of reachability. Garbage collectors add overhead to the program
whenever they run, since they trace the program pointer graph to
determine what is reachable. It is possible that garbage collection
performance could be improved by allowing users to free objects
immediately, without waiting for the garbage collector to collect them.
One way that programmers do this in safe languages like Java is to
write an allocator in the safe language that always hands out objects
of the same type T. The problem with this approach is that, while type
safe, it can lead to dangling pointers, since one part of the program
can free an object while it is still in use. One way to deal with this
problem would be to use dynamic checks. In particular, one could
associate an "age" both with a pointer and the object pointed to.
Whenever an object is freed, the object's age is increased by one.
Whenever a pointer is dereferenced, its age is compared against the
age of the object. If the ages differ, then we have a dangling
pointer.
Your task would be to implement this basic idea in two stages. First,
you could write a library for a special kind of aged pointer. All
allocation, dereferencing, and deallocation of aged pointers would be
via library routines. From this library you could write a custom
allocator, and compare its performance against a custom allocator
without aging. Second, you could implement aged pointers using an
automatic analysis. For example, you could use the CIL framework for
source-to-source transformation of C programs to compile existing C
code to use aged pointers instead. In doing so, you will probably want
to optimize the basic approach, perhaps based on insights gained from
the manual implementation. Some possible optimizations would be to (1)
avoid checks altogether when you can prove an object is linear (i.e.
there are no other aliases to it), (2) have a "lock" on the object that
permits using it check-free until it is unlocked, preventing it from
being freed while it is locked. The latter could also be part of your
manual implementation.
- Combining regions with explicit object deallocation in Cyclone.
Cyclone's memory management uses regions, which are memory areas whose
objects are deallocated all at once. Recent work has shown that regions
can be made more effective by also allowing individual object
deallocation as well. The basic idea would be to extend Cyclone's
support for unique pointers to apply to individual regions, rather than
a single global region. The same thing would work for
reference-counted regions. This approach would require some changes to
the Cyclone type system, but not many, and additions to the runtime
system. After doing this, you would want to implement some
applications and measure the performance benefit of the approach
(recent work published in OOPSLA on this topic has some possible
benchmarks).
- Improving support for reference counting in Cyclone. Reference
counting is a technique that aims to make manual memory management
safe. Every time you make a copy of a pointer, a count associated with
the object pointed to will be increased. Freeing a pointer decreases
the count. When the count is 0, the storage for the object is freed.
Some languages insert reference counting automatically during
compilation, but in many cases, reference counting is performed
manually, by the programmer. One problem with the manual approach is
that the programmer could make a copy of a pointer without incrementing
the count. This would lead to a dangling pointer when the storage is
freed (prematurely).
In Cyclone, we support reference counting, but use strict rules on
reference-counted pointers to ensure that you always increment the
count when you copy a pointer. However, reference-counted pointers can
still be hard to work with for the programmer. In particular: (1)
while you can't forget to increment a count, you can forget to
decrement it. This will lead to a memory leak. (2) since
reference-counted pointers are tracked manually, it can be a hassle to
insert calls to all of the management functions. For this project, you
would aim to solve these problems through changes to the language type
system and implementation. In particular, we can deal with (1) by
performing a "leak analysis" to warn the programmer when a pointer
might be dropped without decrementing its count. We can deal with (2)
by using mechanisms for "borrowed pointers" to allow a programmer to
not manage the counts within a scope, but also not be able to free the
pointer until the scope ends. Cyclone has weak support for borrowed
pointers now, but we could improve it by using more sophisticated
inference techniques. Finally, we could also support Objective-C-style
"autorelease" which is common in MacOS event-based programming.
Other Suggestions for Finding Projects
- Take a look (on the ACM digital library, for example; full access
is available from machines in the umd.edu domain) at the last year or
two's worth of proceedings of PLDI (Programming Langauge Design and
Implmentation) and POPL (Principles of Programming Languages). You
may also want to look at the last couple of years of the Static
Analysis Symposium (published in Lecture Notes in Computer Science by
Springer (http://www.springerlink.com for electronic versions)). You
may also be able to find some of these in the library, or you could
certainly stop by my office.
- Take a look at web pages for the Open Source Quality project at
Berkeley, the Softare
Productivity Tools group at Microsoft, and the Program Analysis,
Transformation, and Visualization group at IBM.
Here is a list of papers about interesting tools or techniques; we
may/will cover some of these in the rest of the class, but clearly we
will not have time to talk about all of them. You may want to read
some to get ideas for projects.
- Boyapati, Liskov, and Shrira, Ownership Types for Object Enacpsulation
- Ernst, Cockrell, Griswold, and Notkin, Dynamically Discovering Likely Program Invariants to Support Program Evolution
- Hallem, Chelf, Xie, and Engler, A System and Language for Building System-Specific Static Analyses
- Sagiv, Reps, and Wilhelm, Parametric Shape Analysis via 3-Valued Logic
- Xi and Pfenning, Dependent Types in Practical Programming
- Ball, Rajamani, Automatically Validating Temporal Safety Properties of Interfaces
- Moller and Schwartzbach, The Pointer Assertion Logic Engine
- Corbett, Dwyer, Hatcliff, Laubach, Pasareanu, Robby, and Zheng, Bandera: Extracting Finite-State Models from Java Source Code
- Condit, Harren, McPeak, Necula, and Weimer, CCured in the Real World
- Strom and Yemini, Typestate: A Programming Language Concept for Enhancing Software Reliability
- Crew, ASTLOG: A Language for Examining Abstract Syntax Trees
- DeLine and Fanhdrich, Enforcing High-Level Protocols in Low-Level Software
- Heintze and Tardieu, Ultra-Fast Aliasing using CLA: A Million Lines of C Code in a Second
- Bush, Pincus, and Sielaff, A static analyzer for finding dynamic programming errors
- Choi, Lee, Loginov, O'Callahan, Sarkar, and Sridharan, Efficient and Precise Datarace Detection for Multithreaded Object-Oriented Programs
- Savage, Burrows, Nelson, Sobalvarro, and Anderson, Eraser: A Dyanmic Data Race Detector for Multi-Threaded Programs
- Flanagan and Freund, Type-Based Race Detection for Java
- Guyer and Lin, Client-Driven Pointer Analysis
- Hovemeyer and Pugh, Finding Bugs is Easy
- O'Callahan and Jackson, Lackwit: A Program Understanding Tool Based on Type Inference (see also Ajax)
- Evans, Static Detection of Dynamic Memory Errors
- Das, Lerner, and Seigle, ESP: Path-Sensitive Program Verification in Polynomial Time
- Polyspace, Abstract Inerpretation
- Flanagan, Leino, Lillibridge, Nelson, Saxe, and Stata, Extended Static Checking for Java
Front Ends
Any program analysis that works on a real language needs a parser for
that language. Fortunately, there are many existing front-ends for
popular languages. CQual has a robust C parser that you can use, and
ifr you want to work in O'Caml, CIL has proven very popular with 631
students in the past; Elkhound is a C++ parser; and Soot and BCEL are
systems for working with Java bytecodes.
