A Static Analyzer for Finding Dynamic Programming Errors William R. Bush, Jonathan D. Pincus, David J. Sielaff A Report - Nikhil Swamy, November 21, 2002 This paper describes the design of a program analysis tool that is intended to address some of the issues below. Objective: Provide a tool that addresses pragmatic defects in programs The tool does not necessarily have to be sound, only "useful" Useful means 1.require little change to the an existing code-base i.e no annotations etc. 2.minimize noise - irrelevant warnings 3.present defects in an easy-to-understand manner 4.and little change to the programming environment How does PREfix claim to achieve this ? The observation is made that most programming errors occur in an *inter*-proc. context. Any reasonably effective tool must perform an inter-procedural path sensitive analysis. The basic model for doing this is simulating the execution of a function on an abstract memory model implemented in a virtual machine. The state of the memory is tracked through every statement in every function as well as across arcs in the call-graph. By observing the state-transformation of this abstract memory a model for the behavior of the function is extracted. General Methodology: The first action performed by PREfix is to parse the source into a AST form. In this form the call-graph is naturally represented and caller-callee relationships are determined. The simulation of functions proceeds in a bottom-up manner - caller before callee. The result of performing a simulation of a function is a model of the function's behavior. So in a typical run prior to analyzing user functions pre-existing models for library functions are loaded. Simulation of a function proceeds at the level of individual statements. The execution of each statement is tracked by keeping record of changes the statement effects on the memory model. The model of memory is arbitrarily detailed - state of memory can be tracked down to the level of a single bit though in practice a more coarse grained model is used. The Memory Model: At each declaration of a variable, parameter or constant a new memory location is initialized. The size of the chunk modeled is dependent on the type of the object under consideration. Associated with each memory location is a set of predicates. These predicates qualify the possible values that may be contained in that location - for instance, the predicate uninitialized may be associated with an uninitialized local, or the predicate may contain a statement about the value contained in a location like x=5, x>0, DONTKNOW(x) etc. Assignments to variables result in a transformation of the predicates associated with its location according to some set of well-specified rules related to the language semantics. - e.g if x > 0, y < 0 then x := x*y results in the new predicate x < 0 associated with x's location Prior to usage in some expression semantic constraints are imposed on each variable in the expression (if required) and defects are generated if the associated predicates do not imply a satisfaction of the constraint. Branches, Multiple Paths and Assumptions: PREfix generates a model for a function by considering every execution-path through a function. This includes possible paths through functions that are called from the initial function. The obvious intractability of this approach is controlled by the use of parameters that limit the total number of execution paths that are analyzed. When faced with a conditional PREfix attempts to evaluate the test condition. If it is possible to evaluate the condition exactly (either the values of the variables in the expression are well-defined or at least restricted to some set that permits deciding the boolean value of the expression) then the appropriate branch of the conditional is simulated. If however, the value of the test-expression is indeterminate (DONTKNOW) then PREfix choses a path randomly from the set of alternatives. Once this choice is made then the predicates associated with each memory location related to the conditional-expr are updated to reflect the assumptions implicit in the choice. e.g void foo(int x) { //bottom-up simulation means param x is indeterminate if(x == 1) { //if this branch is chosen by the simulation then /*B1*/ //on this path we associate x=1 with param x } else { //if this branch is chosen by the simulation then /*B2*/ //on this path we associate x!=1 with param x } } These assumptions are associated with the entire path and allow provide PREfix to provide detailed reports of defects that include the specific conditions under which the defect may occur. Function Models: The simulation of a particular path through a function continues until the end of the function is reached. When a path simulation is complete the assumptions that were made on the path are recorded together with the state of memory (side effects) and the return value if one exists. PREfix then proceeds to analyze alternate paths of execution through the function - choses otherbranches at each of the choice (assumption) points. Once the path-limit is reached for a function (this is heuristically determined reliant on the total path limit) the results of each of the execution paths is aggregated into a model for the function. A model has three components for each path simulated - 1. Constraints : These are hard constraints on certain properties that must be fulfilled to succesfully call this function. These constraints are typically imposed by the semantics of the language. If a constraint is not satisfied by some caller then a defect is generated. e.g (constraint pne P NULL) - the pointer P cannot be NULL 2. Result : This represents the result of the function and includes both return values as well as side effects. e.g (result new_memory size) - function allocates *size* bytes of memory (result return ptr pne NULL) - returns a not NULL pointer 3. Guards : This component of a model represents the assumptions that were made on this path. As such the guards serve to differentiate the multiple paths through the function. The only paths through a function that are possible are those paths that have guards that are satisfied in the context of the caller. A guard can be thought of as a weak constraint. e.g (guard eq x 1) In the branching example above this guard would be associated with the path B1. PREfix uses some heuristics to merge multiple paths etc. to make the representation of the model for a function more compact. The model represents at an parametrized level of accuracy (granularity) the complete behavior of a function with respect to the abstract memory model. Inter-Procedural Simulation: Since models for the callees are generated prior to the simulation of the caller functions PREfix is able to abstract the behavior of the caller by using transforming the memory state in the caller's context by applying the model of the callee. Choice points and assumptions that were related earlier in the case of simple conditionals extend analogously to multiple paths through callee functions. The guards of the chosen path in the model of the callee becomes an assumption in the callers context. Conclusion Some of the nomenclature used in describing PREfix is somewhat ambiguous. While it is claimed that a static method is adopted the detailed simulation of the execution of the program has a distinctly dynamic flavor. This much is acknowledged by the authors although this is done in the context of existentially quantified analysis. The results documented in the paper are mostly related to apache, a 40K LOC application. The claim is made that the tool scales well to 5M+ LOC applications and though results are noted for mozilla (0.5M LOC) this is still an order of magnitude off of the claim. Much emphasis is placed on the presentation of defects to the user, in particular to filter out noisy output. Upto 25% of the reported defects for Mozilla were classified as noise. While the post-analysis filtering approach is pragmatic it does not reflect too favorably on the quality of the analysis. PREfix does not claim to provide a sound analysis of a program. The authors claim however that in practice the defects reported by the tool are relevant real errors. An examination of the sample output of the tool certainly leads one to believe that detailed informative descriptions of defects are generated. It is especially attractive in that no extra annotations or other instrumentation of the source is necessary to run the analysis. I for one would certainly like to at least try using PREfix. Additional Resources: These are links to some presentations made by J.Pincus @ MS Research. They emphasize the attention paid to the user interaction component of the PREfix tool - apparently less than 10% of the PREfix codebase is related to the analyzer engine. These also lay out the high-level motivation for design decisions made in the development of PREfix. research.microsoft.com/specncheck/docs/pincus.ppt research.microsoft.com/si/PPT/InfrastructureForCorrectnessTools.ppt www.cc.gatech.edu/~harrold/issta00/cfp/slides/pincus.issta2000.ppt