We have been making the assumption that for each kind of data, one particular encoding scheme is optimal. Of course, this isn't the case: different schemes will work better with different benchmarks. To achieve even better compression, the compression stage could try several encoding methods of each kind of data, and select the one that happens to work best. The encoded data would include a description of the encoding mechanism used for each data sequence, and would not be substantially harder to decode than if a fixed policy was used for each kind of data.
There are a number of other approaches that might give minor performance improvements. The only change I can think of that would likely give non-trivial improvements would be assume a standard set of pre-loaded references to frequently used package names, classes, method references and so on. It actually isn't guaranteed that this would improve compression (preloaded references that were never used would degrade compression), but I expect it would help on small archives. This would also likely increase the size of the decompressor, so in the situations where the decompressor is not pre-installed, there would not be any net benefit.
As a research tool, the goal is to get as much compression as possible. However, as a tool that might be widely distributed and reimplemented, it might be better to have a specification of the packed format that is simple and clear. It may be appropriate to simplify the format by, for example, dropping approximate stack state (§7.1).
I expect that an implementation will be available for download from http://www.cs.umd.edu/pugh by the date of the conference.