next up previous
Next: Encoding Integers Up: Compressing Java Class Files Previous: Structuring information


Compressing References

Table 3: Size (in bytes) of compressed references

Benchmark Simple Basic Freq Cache Basic Transients Use Context and Context
rt 503,522 480,535 398,303 337,201 301,704 299,159 293,451 291,052
swingall 172,372 159,869 136,241 117,254 110,370 109,211 107,247 106,223
tools 94,293 85,547 71,396 64,417 57,207 56,778 55,408 54,998
icebrowserbean 16,935 14,907 12,945 11,616 10,596 10,550 10,260 10,233
jmark20 18,041 14,497 12,583 9,897 9,879 9,954 9,622 9,658
visaj 124,297 116,316 99,216 84,854 76,585 76,080 74,800 74,400
ImageEditor 25,669 23,473 19,886 16,871 15,834 15,750 15,361 15,323
Hanoi 5,953 4,704 4,245 3,824 3,788 3,794 3,648 3,650
Hanoi_big 3,866 2,973 2,617 2,370 2,316 2,318 2,243 2,242
Hanoi_jax 3,078 2,376 2,112 1,883 1,852 1,874 1,814 1,832
javafig_dashO 22,727 19,963 17,768 16,870 15,954 15,891 15,450 15,380
javafig 27,897 23,285 20,596 19,573 18,199 18,079 17,630 17,481
201_compress 757 516 506 497 461 477 456 470
202_jess 10,032 8,256 6,831 6,347 6,224 6,176 5,969 5,876
205_raytrace 2,603 1,966 1,812 1,762 1,646 1,671 1,550 1,576
209_db 843 575 489 483 466 476 455 467
213_javac 22,338 17,815 15,109 14,325 14,193 14,041 13,622 13,504
222_mpegaudio 4,568 3,440 3,143 2,917 2,706 2,708 2,644 2,674
228_jack 6,025 4,559 4,077 3,993 3,723 3,747 3,521 3,542

 Given a structure for the data we which to encode (Section 4), we need a way of encoding a reference to an object we may have seen before. For primitives (ints, doubles, ...), I just encode the value of the object, without bothering to check if I have seen the object previously.

Otherwise, we need an encoding that either says we have never seen the object before, or identifies the previously seen object. If we have never seen the object before, then at that point we encode all of the fields/components of the object.

I consider a number of approaches to encoding references. The basic approach that worked best was to use a move-to-front encoding. In a move-to-front encoding, we maintain an ordered list of all of the objects seen. Whenever a previously seen object is to be transmitted, we transmit the position of the object in the list (1 for the first object in the list) and move the object to the front of the list. To transmit an object not seen previously, we transmit the value 0 and insert the object at the front of the list.

I implemented move-to-front queues using a modified form of a Skiplist [Pug90] (the Skiplist structure was modified so that each link recorded the distance it travels forward in the list). By starting the search for an element at the bottom level of the Skiplist, increasing the level to the appropriate level for traversing the Skiplist, and then using a normal Skiplist traversal, I was able to achieve an expected time bound of $O(\log k)$ to do a move-to-front operation on element k of the queue, regardless of the total number of elements in the queue.

This was all that was needed in the decompressor. In the compressor, we also need a way, given an element we may have seen before, to determine if we have seen the element before and if so, where the element is now in the queue. This was implemented by a hashtable from elements to the Skiplist nodes that store them. Once we are at the Skiplist node for an element, we can walk forward to the end of the list (at each node, follow the highest link out of that node, keeping track of the distance traversed by each link). Knowing the distance to the end of the list and the total size of the list allows us to calculate the distance of the element from the front of the list. These operations can all be done in expected time $O(\log n)$, where n is the number of elements in the queue.

A move-to-front generally does an excellent job of producing lots of references with small encodings, which then can often be encoded in a single byte and compresses well with a Huffman encoding. However, a move-to-front encoding pretty much destroys any patterns in the object stream (e.g., an aload_0 instruction is often followed by a getfield instruction). I tried using a move-to-front encoding for JVM opcodes, then using zlib on the result, and got much worse compression than using zlib on the original JVM opcodes. The zlib compression scheme both finds repeating patterns and uses a Huffman-like encoding to efficiently transmit a stream of bytes with a nonuniform distribution pattern. Thus, a move-to-front encoding may do an excellent job when zlib cannot find significant repeating patterns to exploit, but do poorly when they exist.

I compared using zlib on the byte stream generated by a move-to-front encoding with using a Arithmetic encoding on the indices generated by a move-to-front scheme. In the Arithmetic encoding, encoding an index that occurs with probably p requires log2 1/p bits. Given the hypothesis that a move-to-front encoding destroys references patterns and only produces a skewed probably pattern, we would expect the Arithmetic encoding to do better. In the cases I examined, this expectation was fulfilled. For example, for references to virtual methods in rt.jar, using zlib gave results that were 2% bigger than an Arithmetic encoding.

However, these results do not include the size of the dictionaries for the arithmetic encoding, and arithmetic encoding is rather expensive to compress and decompress. The size of the dictionary would be larger than the savings unless it was fitted to a curve and just the parameters for the curve were encoded. Given the negligible or non-existent benefits and the performance cost ditching the built-in zlib decoder for a arithmetic decoder, I decided that this option wasn't worth pursuing.


I considered the following variants, as

Except where noted, seperate pools were used for virtual, interface, static and special method references, and for static and instance field references. The resulting indicies are encoded as a byte stream and compressed as described in Section 6.

Baseline: Simple

Each object is assigned a fixed id. Id's are assigned sequentially, as objects are first seen. All id's are encoded as two bytes. A single pool is used for all method references, and a single pool is used for all field references.

Baseline: Basic

Each object is assigned a fixed id. Id's are assigned sequentially, as objects are first seen, but are encoded compactly.

Competitor: Freq

Like Basic, except that ids were assigned to objects so that the most frequently referenced objects had the smallest id's. Elements that only occur once are all encoded with the same special id.

Competitor: Cache

The Freq scheme was augmented with a LRU cache of 16 elements, implemented as a move-to-front queue. If an object is in cache, it is encoded according to its position in the cache. Separate caches were used for each context.

Variant: move-to-front, transients

In this scheme, objects that are seen exactly once are encoded specially and are not entered into the move-to-front queue.

Variant: move-to-front, use context

For method references, in addition to maintaining different MTF queues for different method kinds (virtual, interface, ...), we also maintain different MTF queues based on top two values of the computed approximate stack state (described in Section 7.1). Thus, we have one MTF queue for virtual methods invoked when there are two integers on top of the stack, and another MTF queue for virtual methods invoked when the top two values on the stack are a reference and an integer.

I do use a common pool for method names across all method types (virtual, static,...), particularly to avoid creating duplicate constant pool entries. However, under this option, we maintain different MTF queues for each method type.

One complication here is that when a method reference is seen for the first time, it must be inserted into all of the MTF queues where it might be seen later.

next up previous
Next: Encoding Integers Up: Compressing Java Class Files Previous: Structuring information
William Pugh