Next: Other issues Up: Compressing Java Class Files Previous: Compressing Bytecodes

Compressing Sets of Strings

The zlib compression algorithm works very well on text, and so we correctly expect that it would work well on a list of strings. However, because strings make up a substantial portion of the information in Java class file (even once we have factored out information like class names and package names), it is important to do as well as possible.

Our approach to handling strings is similar to that for objects in general. The first time a string is encounted, we encode a special index to indicate a value not seen before, and we write the Unicode string using the UTF encoding. Different categories of strings (e.g., string constants or method names) are put into seperate streams. Strings lengths are written to a separate stream than the Unicode characters (mixing the two degrades compression). When a string is encounted again, we encode a reference to it using the scheme used for objects in general, as discussed in Section 5 (e.g., the index into a move-to-front queue or a fixed-id).

William Pugh