Research Statement
Michael D. Beynon

      My research interests are in the area of parallel and distributed systems. Real applications are the driving force behind my work. I prefer to study important emerging applications that are not easy to run efficiently. I experimentally identify performance problems with their execution, derive solutions to these problems, and try to generalize these solutions to other applications in the same class. This provides insight to abstract away application details and produce general techniques which are forward looking, but still rooted in real applications. The following sections list some potential research problems I would like to pursue. Note that Lawrence Livermore National Labs are currently funding the work covered by my thesis at Maryland, and are planning to increase the funding level for three years, which demonstrates usefulness and promise.

      There are many emerging applications that would benefit greatly if efficient data-intensive applications were easier to develop, especially in a distributed environment. In my dissertation, I developed a framework that allows data-intensive applications to run more efficiently in a wide-area environment, which still leaves some of the design burden on the application designer. Ideally, we would like to write an application in any programming language, and have a combination of compiler and runtime system decide how and where to execute it. I call this Simple Data-Intensive Computing, where the most important goal is simplicity. There are many unsolved research and engineering problems that would help make this overall vision possible, which is shared by many leading researchers in the community.

Compiler / Programming Model

      In my thesis work, applications are manually decomposed into appropriate sized components by the programmer. Unfortunately, the particular decomposition used by the programmer may not be an efficient choice for its eventual runtime environment. The choice of granularity to use for the size and scope of components is a very hard problem. Complicating factors include unknown runtime hosts, network characteristics, data-source locations, etc., implying the decomposition process may need to be revisited at execution time. Existing research into parallelizing compilers could be utilized to help decide how to break up the original code into components, but does not address the dependency on runtime information needed in the decomposition process. One approach is to defer part of the compilation until runtime. At compile time, the application code is divided into small unit size pieces, still in high-level form. The actual decomposition is deferred until runtime when a custom decomposition in created by repeated merging of the unit size pieces according to some goodness metric. These specialized components are then compiled into native code for each target architecture as needed. This two-phase compilation also avoids the overhead of using a virtual machine (as in Java's VM) at runtime to provide platform independence. Additional work is needed to design a scalable service to provide this runtime compilation, distribution, and caching of specialized binaries for execution.

      Currently, compilers produce object code for a particular platform, and the high level control and data flow information is generally discarded. In my thesis, I relied on the application developer to provide extra information about their code. This extra information should come from the compiler. Useful information includes a semantic description of the data flow within a component, some notion of the complexity or runtime cost of execution, memory and other resource usage, etc. This information is necessary during runtime scheduling because placement of components involves a limited form of performance prediction to compare feasible locations.

Operating System Design

      Middleware typically sits atop operating systems, to provide services that span individual hosts. Much of the implementation in my thesis work was dictated by available operating system support at the various distributed locations. There is an interesting research problem into how the design of an operating system would change if we could assume data-intensive applications were using a component model similar to that in my thesis, rather than general purpose code. Combining this programming model knowledge with extra information from the compiler could allow the OS itself to perform much of work now delegated to the middleware layer. Improved page replacement policies, scheduling, etc., could all significantly improve performance. Lesser in scope than potentially sweeping OS design changes, I am also interested in identifying useful new OS abstractions that would provide middleware layers the information and hooks they need to achieve similar results.

Data Discovery and Caching

      Any distributed application requires some notion of location independence to be able to operate. Similarly, data-intensive applications should not manually specify where their large input data-sets are located. Research in distributed filesystems exploit this decoupling to provide useful features such as automatic replica selection, transparent caching of I/O requests near clients, and request for data by property or description instead of filename. The usefulness of these features is tempered by unknown data-set location and non-uniform I/O times due to caching, which inhibits effective application decomposition and runtime scheduling. In my thesis, the idea was to push application components toward the data-set. The best approach is likely to combine these two approaches by developing a hybrid that both pulls and caches data, while also moving the application components to good locations. For example, consider the scenario where an application component needs data from more than one data-source. Previously, the component could be located at one of the data-sources, or at a third location; none of which is usually a good choice if the data-sources are remote from each other. The hybrid approach would allow a cache to be populated at one of the locations, which would alleviate some of the latency penalty when there is no clear placement choice.

      When considering I/O requests for raw data, the caching problem is well-defined, because disk resident data has a unique name (file-id, offset, length) that can be used for cache lookups. If instead we try to cache the intermediate data products that exist between two components, we find it is not clear how to identify, describe, name, and match entities with future accesses. An access is likely embedded in fields within an application level protocol, and application logic decides how this maps into the intermediate data, which makes automatic caching difficult. There is no clear disk block request and result, as is found in I/O caching. Work is needed to decide what application level knowledge must be exported to enable caching of intermediate results for a single application or concurrent applications. An alternate approach is to instead provide access to a caching tool library, which the components can use if they so choose in cases where it makes sense. This approach follows the end-to-end argument, and appears most likely to work in practice. The potential benefits would be large in terms of efficiency, but this also creates other problems such as security concerns when dealing with multiple applications.

Ease of Debugging

      Significant work is needed to assist with debugging distributed applications. Current methods are ad-hoc and there is no systematic framework that is consistent and general. Work is needed to effectively provide debugging mechanisms for distributed applications written using a component architecture. A clear and reasonable approach would be to design the debugging service using components that have privileged access to other running applications.

Non-distributed Situations

      The techniques described thus far are not necessarily limited to distributed applications. For example, consider an application that processes data in a data-flow fashion, and the runtime host is a shared-memory multi-processor. For many applications, it may be possible to create additional threads to leverage the application structure and make use of the additional processor(s), thus increasing performance. Instead of forcing an application to be manually rewritten to allow this internal parallel processing of data, we can use the above techniques to specialize the code for this runtime environment which could include multiple data-flow pipelines of components that are tuned to the number of processors and load by using the extra compiler information. Therefore, use of component techniques beyond distributed applications is another promising research area.