Basic approaches

As you can see in Table 2, the constant pool entries often make up most of the size of a classfile. In fact, the Utf8 entries alone often make up most of size of a classfile. Simply sharing Utf8 entries across classfiles leads to substantial reduction.

Table 2: Classfile breakdown

	Uncompressed
	Size (Kbytes)
Component	swingall	javac
Total size	3,265	516
excluding jar overhead	3,010	485
Field definitions	36	7
Method definitions	97	10
Code	768	114
other	72	12
constant pool	2,037	342
Utf8 entries	1,704	295
if shared	371	56
if shared & factored	235	26

But now we have to face another issue: encoding of references to constant pool entries. In most classfiles, the number of constant pool entries is relatively small. While the standard classfile format usually allots 2 bytes to encode a reference to a constant pool entry, we can often do so in a single byte. If we compress single byte control pool references, we are likely to get good compression. But if we pool constant pool entries, it is unlikely that we will be able to encode most constant pool entries in a single byte.

Numerous data compression algorithms have been developed. Many of the lossless compression algorithms in wide use were originally designed as text compression algorithms, and work on a stream of bytes. In particular, the Lempel-Ziv family of compression algorithms have a very strong byte orientation. While it might be possible to adapt them to compress a stream of larger values (e.g., 16-bit values), it isn't clear how efficient they would be. At any rate, efficient implementations of the byte-oriented zlib library exist on most platforms and is part of the standard Java API, so utilizing the existing library makes sense.

A first solution is to use different numbering for different kinds of constant pool entries (e.g., we can have Class 17, and MethodRef 17, and IntegerConstant 17). In almost all¹ contexts, we know the type of the constant pool entry whenever we reference it, so this won't cause confusion.

This helps some and might be sufficient for small archives. However, on large archives these techniques will not be sufficient to allow us to encode most references in a single byte. In addition to encoding most references within a single byte, we would also like the encoding bytestream to have a very skewed distribution, so that it can be further compressed. Techniques for encoding references are discussed in more detail in Section 5.

Even sharing the Utf8 entries still results in a fair bit of redundancy. Each time a classname is encoded, the full package name is encoded (e.g., java.lang), and classnames appear in full text in the types of fields and methods. For example, the type of a method that takes one string as an argument and returns a string is encoded as (Ljava.lang.String;)Ljava.lang.String;. If we factor out this duplication, we get another substantial reduction in the space required for string constants. Note that this factoring amounts to a wholesale reorganization of the classfile; the reorganization is described in more detail in Section 4.

The savings in uncompressed size realized by eliminating redundancy often doesn't fully materialize in the size of a compressed archive. By eliminating redundancy, we have removed one of the elements the compressor was using to get better compression. While factoring and other techniques are useful, they are often not as effective as they seem at first.