You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format. However, this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the Department of Computer Science of the University of Maryland at College Park under terms that include this permission. All other rights are reserved by the author(s).
Integer tuple relations can concisely summarize many types of information gathered from analysis of scientific codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple relations by enumerating the related pairs of tuples. For example, it is impossible to enumerate the related pairs of tuples in the relation {[i] -> [i+2] | 1 <= i <= n-2}. Even when it is possible to enumerate the related pairs of tuples, such as for the relation {[i,j] -> [i',j'] | 1 <= i,j,i',j' <= 100}, it is often not practical to do so. We instead use a closed form description by specifying a predicate consisting of affine constraints on the related pairs of tuples. As we just saw, these affine constraints can be parameterized, so what we are really describing are infinite families of relations (or graphs). Many of our applications of tuple relations rely heavily on an operation called transitive closure. Computing the transitive closure of these "infinite graphs" is very different from the traditional problem of computing the transitive closure of a graph whose edges can be enumerated. For example, the transitive closure of the first relation above is the relation {[i] -> [i'] | exists beta s.t. i'-i = 2beta and 1 <= i <= i' <= n}. As we will prove, this computation is not computable in the general case. We have developed algorithms that produce exact results in most commonly occurring cases and produce upper or lower bounds (as necessary) in the other cases. This paper will describe our algorithms for computing transitive closure and some of its applications such as determining which inter-processor synchronizations are redundant.
(Also cross-referenced as UMIACS-TR-95-48)
There has been a great amount of recent work toward unifying iteration reordering transformations. Many of these approaches represent transformations as affine mappings from the original iteration space to a new iteration space. These approaches show a great deal of promise, but they all rely on the ability to generate code that iterates over the points in these new iteration spaces in the appropriate order. This problem has been fairly well-studied in the case where all statements use the same mapping. We have developed an algorithm for the less well-studied case where each statement uses a potentially different mapping. Unlike many other approaches, our algorithm can also generate code from mappings corresponding to loop blocking. We address the important trade-off between reducing control overhead and duplicating code.
(Also cross-referenced as UMIACS-TR-94-87.1)
As recent studies show, state-of-the-art parallelizing compilers produce no noticeable speedup for 9 out of 12 PERFECT benchmark codes, while the speedup that was reached by manually applying certain automatable techniques ranges from 10 to 50. In this paper we introduce the Global Value Propagation algorithm that unifies several of these techniques.
Global propagation is performed using program abstraction called Value Flow Graph (VFG). VFG is an acyclic graph in which vertices and arcs are parametrically specified using F-relations. The distinctive features of our propagation algorithm are: (1) It propagates not only values carried by scalar variables, but also values carried by individual array elements. (2) We do not have to transform a program in order to use propagation results in program analysis.
In this paper we focus on use of the VFG and global value propagation in array dataflow analysis. F-relations are used to represent values produced by uninterpreted function symbols that appear in dependence problems for non-affine program fragments. Global value propagation helps us to discover that some of these functions are in fact affine.
(Also cross-referenced as UMIACS-TR-94-80)
Traditionally, optimizing compilers attempt to improve the performance of programs by applying source to source transformations, such as loop interchange, loop skewing and loop distribution. Each of these transformations has its own special legality checks and transformation rules which make it hard to analyze or predict the effects of compositions of these transformations. To overcome these problems we have developed a framework for unifying iteration reordering transformations. The framework is based on the idea that all reordering transformation can be represented as a mapping from the original iteration space to a new iteration space. The framework is designed to provide a uniform way to represent and reason about transformations. An optimizing compiler would use our framework by finding a mapping that both corresponds to a legal transformation and produces efficient code. We present the mapping selection problem as a search problem by decomposing it into a sequence of smaller choices. We then characterize the set of all legal mappings by defining an implicit search tree.
(Also cross-referenced as UMIACS-TR-94-71)
Keywords: Automatic Parallelization, Compilation, Optimization, Array Data Dependence Analysis, Presburger Arithmetic, Omega test, Dependence Relation
This paper will appear in ACM TOPLAS
Existing compilers often fail to parallelize sequential code, even when a program can be manually transformed into parallel form by a sequence of well-understood transformations (as is the case for many of the Perfect Club Benchmark programs). These failures can occur for several reasons: the code transformations implemented in the compiler may not be sufficient to produce parallel code, the compiler may not find the proper sequence of transformations, or the compiler may not be able to prove that one of the necessary transformations is legal.
When a compiler extract sufficient parallelism from a program, the programmer extract additional parallelism. Unfortunately, the programmer is typically left to search for parallelism without significant assistance. The compiler generally does not give feedback about which parts of the program might contain additional parallelism, or about the types of transformations that might be needed to realize this parallelism. Standard program transformations and dependence abstractions cannot be used to provide this feedback.
In this paper, we propose a two step approach for the search for parallelism in sequential programs: We first construct several sets of constraints that describe, for each statement, which iterations of that statement can be executed concurrently. By constructing constraints that correspond to different assumptions about which dependences might be eliminated through additional analysis, transformations and user assertions, we can determine whether we can expose parallelism by eliminating dependences. In the second step of our search for parallelism, we examine these constraint sets to identify the kinds of transformations that are needed to exploit scalable parallelism. Our tests will identify conditional parallelism and parallelism that can be exposed by combinations of transformations that reorder the iteration space (such as loop interchange and loop peeling).
This approach lets us distinguish inherently sequential code from code that contains unexploited parallelism. It also produces information about the kinds of transformations that will be needed to parallelize the code, without worrying about the order of application of the transformations. Furthermore, when our dependence test is inexact, we can identify which unresolved dependences inhibit parallelism by comparing the effects of assuming dependence or independence. We are currently exploring the use of this information in programmer-assisted parallelization.
(Also cross-referenced as UMIACS-TR-94-40)
This paper will appear in the Proceedings of the 1994 ACM SIGPLAN Conference on Programming Language Design and Implementation
We describe methods that are able to count the number of integer solutions to selected free variables of a Presburger formula, or sum a polynomial over all integer solutions of selected free variables of a Presburger formula. This answer is given symbolically, in terms of symbolic constants (the remaining free variables in the Presburger formula).
For example, we can create a Presburger formula who's solutions correspond to the iterations of a loop. By counting these, we obtain an estimate of the execution time of the loop.
In more complicated applications, we can create Presburger formulas who's solutions correspond to the distinct memory locations or cache lines touched by a loop, the flops executed by a loop, or the array elements that need to be communicated at a particular point in a distributed computation. By counting the number of solutions, we can evaluate the computation/memory balance of a computation, determine if a loop is load balanced and evaluate message traffic and allocate message buffers.
(Also cross-referenced as UMIACS-TR-94-27)
This paper appears in the Proceedings of the Sixth Annual Workshop on Programming Languages and Compilers for Parallel Computing
Standard array data dependence testing algorithms give information about the aliasing of array references. If statement 1 writes a[5], and statement 2 later reads a[5], standard techniques described this as a flow dependence, even if there was an intervening write. We call a dependence between two references to the same memory location a memory-based dependence. In contrast, if there are no intervening writes, the references touch the same value and we call the dependence a value-based dependence.
There has been a surge of recent work on value-based array data dependence analysis (also referred to as computation of array data-flow dependence information). In this paper, we describe a technique that is exact over programs without control flow (other than loops) and non-linear references. We compare our proposal with the technique proposed by Paul Feautrier, which is the other technique that is complete over the same domain as ours. We also compare our work with that of Tu and Padua, a representative approximate scheme for array privatization.
(Also cross-referenced as UMIACS-TR-93-137)
We present a framework for unifying iteration reordering transformations such as loop interchange, loop distribution, skewing, tiling, index set splitting and statement reordering. The framework is based on the idea that a transformation can be represented as a schedule that maps the original iteration space to a new iteration space. The framework is designed to provide a uniform way to represent and reason about transformations. As part of the framework, we provide algorithms to assist in the building and use of schedules. In particular, we provide algorithms to test the legality of schedules, to align schedules and to generate optimized code for schedules.
(Also cross-referenced as UMIACS-TR-93-134)
Data dependence distance is widely used to characterize data dependences in advanced optimizing compilers. The standard definition of dependence distance assumes that loops are normalized (have constant lower bounds and a step of 1); there is not a commonly accepted definition for unnormalized loops. We have identified several potential definitions, all of which give the same answer for normalized loops. There are a number of subtleties involved in choosing between these definitions, and no one definition is suitable for all applications.
(Also cross-referenced as UMIACS-TR-93-133)
Array data dependence analysis methods currently in use generate false dependences that can prevent useful program transformations. These false dependences arise because the questions asked are conservative approximations to the questions we really should be asking. Unfortunately, the questions we really should be asking go beyond integer programming and require decision procedures for a subclass of Presburger formulas. In this paper, we describe how to extend the Omega test so that it can answer these queries and allow us to eliminate these false data dependences. We have implemented the techniques described here and believe they are suitable for use in production compilers. (An earlier version of this paper appeared at the ACM SIGPLAN PLDI'92 conference).
(Also cross-referenced as UMIACS-TR-93-132)
Automatic parallelization of real FORTRAN programs does not live up to users expectations yet, and dependence analysis algorithms which either produce too many false dependences or are too slow contribute significantly to this. In this paper we introduce data-flow dependence analysis algorithm which exactly computes value-based dependence relations for program fragments in which all subscripts, loop bounds and IF conditions are affine. Our algorithm also computes good affine approximations of dependence relations for non-affine program fragments. Actually, we do not know about any other algorithm which can compute better approximations.
And our algorithm is efficient too, because it is lazy. When searching for write statements that supply values used by a given read statement, it starts with statements which are lexicographically close to the read statement in iteration space. Then if some of the read statement instances are not ``satisfied'' with these close writes, the algorithm broadens its search scope by looking into more distant writes. The search scope keeps broadening until all read instances are satisfied or no write candidates are left.
We timed our algorithm on several benchmark programs and the timing results suggest that our algorithm is fast enough to be used in commercial compilers --- it usually takes 5 to 15 percent of f77 -O2 compilation time to analyze a program. Most programs in the 100-line range take less than 1 second to analyze on a SUN SparcStation IPX.
(Also cross-referenced as UMIACS-TR-93-69)
This is revised version of the CS-TR-3109 report that appeared July, 1993.
Why do existing parallelizing compilers and environments fail to parallelize many realistic FORTRAN programs? One of the reasons is that these programs contain a number of linearized array references, such as {\tt A(M*N*i+N*j+k)} or {\tt A(i*(i+1)/2+j)}. Performing exact dependence analysis for these references requires testing polynomial constraints for integer solutions. Most existing dependence analysis systems, however, restrict themselves to solving affine constraints only, so they have to make worst-case assumptions whenever they encounter a polynomial constraint.
In this paper we introduce an algorithm which exactly and efficiently solves a class of polynomial constraints which arise in dependence testing. Another important application of our algorithm is to generate code for loop transformation known as symbolic blocking (tiling).
(Also cross-referenced as UMIACS-TR-93-68.1)
In previous work, we presented a framework for unifying iteration reordering transformations such as loop interchange, loop distribution, loop skewing and statement reordering. The framework provides a uniform way to represent and reason about transformations. However, it does not provide a way to decide which transformation(s) should be applied to a given program.
This paper describes a way to make such decisions within the context of the framework. The framework is based on the idea that a transformation can be represented as a schedule that maps the original iteration space to a new iteration space. We show how we can estimate the performance of a program by considering only the schedule from which it was produced.
We also show how to produce an upper bound on performance given only a partially specified schedule. Our ability to estimate performance directly from schedules and to do so even for partially specified schedules allows us to efficiently find schedules which will produce good code.
(Also cross-referenced as UMIACS-TR-93-67)