next up previous
Next: Structuring information Up: Compressing Java Class Files Previous: Methodologies and Baselines

Basic approaches

When considering techniques for compressing Java classfile archives, one of the first techniques that jumps to mind is reusing constant pool entries. Constant pool entries are the things that can be referenced within a classfile: examples include classes, methods, integers, doubles, and utf8 encodings of Unicode strings. Often, constant pool entries reference to other constant pool entries. For example, the constant pool entry for a method reference consists of a reference to (the constant pool entry for) the class containing the method, and a reference to the signature of the method.

As you can see in Table 2, the constant pool entries often make up most of the size of a classfile. In fact, the Utf8 entries alone often make up most of size of a classfile. Simply sharing Utf8 entries across classfiles leads to substantial reduction.

Table 2: Classfile breakdown
  Size (Kbytes)
Component swingall javac
Total size 3,265 516
excluding jar overhead 3,010 485
Field definitions 36 7
Method definitions 97 10
Code 768 114
other 72 12
constant pool 2,037 342
Utf8 entries 1,704 295
if shared 371 56
if shared & factored 235 26

But now we have to face another issue: encoding of references to constant pool entries. In most classfiles, the number of constant pool entries is relatively small. While the standard classfile format usually allots 2 bytes to encode a reference to a constant pool entry, we can often do so in a single byte. If we compress single byte control pool references, we are likely to get good compression. But if we pool constant pool entries, it is unlikely that we will be able to encode most constant pool entries in a single byte.

Numerous data compression algorithms have been developed. Many of the lossless compression algorithms in wide use were originally designed as text compression algorithms, and work on a stream of bytes. In particular, the Lempel-Ziv family of compression algorithms have a very strong byte orientation. While it might be possible to adapt them to compress a stream of larger values (e.g., 16-bit values), it isn't clear how efficient they would be. At any rate, efficient implementations of the byte-oriented zlib library exist on most platforms and is part of the standard Java API, so utilizing the existing library makes sense.

A first solution is to use different numbering for different kinds of constant pool entries (e.g., we can have Class 17, and MethodRef 17, and IntegerConstant 17). In almost all1 contexts, we know the type of the constant pool entry whenever we reference it, so this won't cause confusion.

This helps some and might be sufficient for small archives. However, on large archives these techniques will not be sufficient to allow us to encode most references in a single byte. In addition to encoding most references within a single byte, we would also like the encoding bytestream to have a very skewed distribution, so that it can be further compressed. Techniques for encoding references are discussed in more detail in Section 5.

Even sharing the Utf8 entries still results in a fair bit of redundancy. Each time a classname is encoded, the full package name is encoded (e.g., java.lang), and classnames appear in full text in the types of fields and methods. For example, the type of a method that takes one string as an argument and returns a string is encoded as (Ljava.lang.String;)Ljava.lang.String;. If we factor out this duplication, we get another substantial reduction in the space required for string constants. Note that this factoring amounts to a wholesale reorganization of the classfile; the reorganization is described in more detail in Section 4.

The savings in uncompressed size realized by eliminating redundancy often doesn't fully materialize in the size of a compressed archive. By eliminating redundancy, we have removed one of the elements the compressor was using to get better compression. While factoring and other techniques are useful, they are often not as effective as they seem at first.

next up previous
Next: Structuring information Up: Compressing Java Class Files Previous: Methodologies and Baselines
William Pugh