Yi Qing www.cs.rice.edu/~qingyi Transforming Loop Structures Over the past 20 years, the speed of processors in computer systems has increased much faster than the speed of memories. To achieve high performance, applications must exploit enough locality so that data can be reused before being evicted from the cache memories between the processor and main memory. Many compiler techniques have been developed to automatically transform applications to enhance locality. These techniques are both effective and efficient when the loop structures of the applications are simple, preferably perfectly nested loops. However, on more complicated loop structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. This talk introduces techniques to transform arbitrary loop structures both effectively and efficiently. Building on traditional {\em unimodular} transformation techniques on perfectly nested loops, we have developed a technique called {\em computation slicing} to re-arrange the loop structure of arbitrary nested loops. First, we adopt a loop transformation model that is not limited by the original nesting structure of the program and extend traditional dependence representations to model dependence relations between iterations of arbitrary loops. We then apply transitive analysis on the dependence graph and compute complete transitive dependence information between these loops. These transitive dependence edges are used to decide the feasibility of a set of loop transformations independent of their original nesting structure. The original program is then transformed directly according to the accumulated information. The complexity of a slicing transformation is $N^2 + N*L*D$, where $N$ is the number of statements, $L$ is the maximum depth of loop nests, $D$ is the size of the dependence graph. This complexity is comparable to that of standard unimodular loop transformations on perfectly nested loops. We also developed a transformation framework which systematically applies computation slicing to block arbitrary loop nests for better locality. The complexity of this framework is $N^2*L + N*L^2*D$. We believe this framework is efficient enough to be used in most commercial compilers. We have implemented the framework as a source-to-source translator and applied it to successfully transform four numerical benchmark kernels: Cholesky, QR, LU factorization without pivoting and LU with partial pivoting. We choose these benchmarks because they are considered very hard to block automatically. To our knowledge, no previous compiler implementation has completely automated the blocking of QR and LU with pivoting. Our implementation has successfully blocked all these benchmarks and achieved performance improvement comparable to those code manually blocked by experienced algorithm designers~\cite{LAPACK3}. This implies that this framework is in practice comparable in power and effectiveness to most previous general transformation frameworks.