Kevin Conroy: Understanding the B+ Tree

Understanding the B+ tree

or How to find that page without a fault

By Kevin Conroy
Back to the Main Page
Last updated August 25, 200September 15, 2004Question

"I don't know what a page is, really, but I guess it has something to do with hardware."

Memory pages are a matter of much confusion for 420 students throughout the ages.

On most modern computers, a page is a 4K (or 4096 bytes) chunk of memory. The RAM may or may not reflect this fact physically. In all truth, it's the operating system which imposes the idea of pages on to the abstraction on top the "array" of contiguous memory.

The B+ tree is designed to take advantage of this abstract and trick the operating system into giving you better performance than you can get from other O(log n)-searchable data structures. The detail I go into here is really just to try and help you get a better understanding what page faults are. Understanding them well will certainly help you on a question or two on the final (I would imagine) but many of you can probably get away with an abstract understanding of them.

High Level (and Highly Cliche) Analogy

Memory is like a box of chocolates. (Seriously, this one works.) You've got all of this space to store values (chocolate). The operating system's memory manager is like the little plastic try in the box. It partitions it into chunks that you can work with (pages). In each slot (page) you can store values (chocolates) which you can get at one at a time. When you go for a chocolate and there's not one there (or it's not a flavor that you like), it's a page fault.

Why bother with the B+ Tree?

So you've gone through the trouble of learning about the new data structure called a B+ tree. It's got their funky internal nodes with "key" rather than "values/objects." But why go through all of this trouble when we could have a perfectly fine skip list?

Skip lists, like many fine data structures and algorithms that you've learned about, are platform and architecture independent in theory. They are optimized for speed, have low storage overhead (for the most part), and can be ported to any system your little code monkey heart can get it to compile on. This approach is good because it doesn't concern itself with the details of how the computer works and separates the software design from the hardware design.

This is the essence of object-oriented programming. "I've got this machine which will follow these commands and will do all of the work with reading/writing to memory and registers so I'll just worry about getting my logic right." You don't care how it saves values to memory locations, or how pass by value/reference works. You just code. There is a hard line between hardware and software. You make software which worries about things at a software level and you make hardware which worries about things are a hardware level. (We'll just ignore things like software drivers for hardware.)

But sometimes you may want to blur that line. Sometimes, if you understand how the machine performs a particular task, you can use your knowledge of that to write software which plays off of this knowledge to write more efficient code.

Imagine if you opened up the source to the Java API and studied it for a month. Afterwards, you'd know all sorts of tricks about how to cut corners to save memory and CPU time. You'd probably be able to win some top coder competitions. ;-)

The B+ tree does a similar thing, only rather than the Java API, we're concerning ourselves with how the CPU (i.e. your Pentium chip) gets things from memory (RAM). Let's take a look under the hood to see what's going on.

A Crash Course on Computer Architecture and Operating Systems

NOTE: What I'm about to tell you takes up 2 chapters in the CMSC412 (Operating Systems) book and 1 chapter in the CMSC411 (Computer Architecture) book, so if you want more detail, take one of those classes and you will get it. I'm glossing over and skipping some very large pieces that make the computer work that you don't need to worry about for the B+ tree.

NOTE for anyone who knows CMSC411 (Computer Architecture): We're going to just assume access times to L1, L2, and L3 cache are about the same for the following examples. We're really concerned about main memory and hard disk accesses.

The Slavery of Being a C[PU|S Major]

Imagine that you're (literally) chained to your desk all night doing homework. That's the CPU (esp. when you're doing 420!). Since you're at your desk, you can only get to a few things like the piles of papers or the books sitting on your desk or nearby on the floor. All of the stuff in McKeldin library is far out of reach.

This is the situaSeptember 15, 2004che of data memory. You can think of this as two piles of papers sitting on your desk. One contains instructions which are assembly language instructions (MIPS, DLX, etc). The other contains data values which are really just a copy of a tiny fraction of main memory. Data cache is like a cheat sheet that you make for an exam. It (hopefully) has lots of pertinent information for your test but it's only one page and try as you might, there 's no way that you can or want to put the entire book onto that single page of paper.

So the CPU sits there, going through the instructions in its instruction cache. You sit at your desk doing things in your "to do" pile. The CPU writes and reads values from the data cache... you look up things and write down notes on your cheat sheet.

That's all fine and dandy, but what happens when the CPU runs out of instructions? That'd be the same thing as you suddenly realizing that the 420 spec you printed out from OIT is missing a page and you're going to have to walk all the way over to AVW to pick it up. What is the CPU has to look up a data value that isn't in the data cache? That's just like you've been doing a research project and realize you have to go get a book out of McKeldin library to finish your paper.

This, my fellow coders, is a page fault.

It is a nasty, ugly, time consuming process.

The CPU requests a memory location from instruction or data cache which is not present, and then the memory-management hardware does it's thing (in conjunction with the operating system) and goes to main memory (RAM) or even worse, to the hard disk, and retrieves the value requested.

Main memory, or RAM, is like McKeldin library. It's big, has lots of stuff that you want, and as long as you're a UMD student, it's pretty easy to get information out of it - just swing by there between classes. The hard disk is like the National Archives. It's much much bigger than main memory (McKeldin) but if you want to get something out of the archives, you have to waste like an hour on the metro or driving down there, going in, finding all of your information, and then coming back to campus. It's a royal pain in the butt.

Why a page?

But what's all this "page" stuff? Why not just call it a memory fault or something?

Well, given the time and cost it takes to go to memory or worse the hard disk, the CPU is going to grab a whole page of memory rather than just a single byte/word. The size of a page is architecture dependent. For instance, the x86 Intel architecture has 4K (that's 4096 bytes) pages. You'd do the same thing. You wouldn't just go to OIT and take one sentence or one paragraph and leave the rest of the sheet there and come back later and get the next sentence. No, you'd take the whole freaking page and worry about which parts you needed to read later. Same think with McKeldin or the Archives - you'll check out the entire book or photocopy the pages you want rather than just copying a single sentence from a book.

So a page fault is this expensive, time consuming operation which requires going to main memory or the hard disk and copying a large hunk of values into the cache so that the CPU can get to it quickly.

Exploiting Pages

Okay, hopefully you're clear on what a page fault is and why they are bad. Now, let's use this information to make our programs run faster.

Let's say we're looking through the values in a skip list trying to figure out where to insert a value. The first node is too small, go to the next one. Oh no! The next node we're going to is so far away from the first one in memory that we've incurred a page fault. Now the CPU has to go to main memory or the hard disk to get the node you want to look at and bring it into the data cache. (You're trying to finish a stupid paper and have to go to McKeldin at 2am to get a book your forgot.) It's bad. It's annoying. It's really slowing you down because you're trying to traverse this array quickly to figure out where to go. So you finally get the node you want into data
cache, realize it's too small too, and try to go to the next node. And you page fault again. You could keep doing this the whole time as you go through the skip list. Imagine that you just kept realizing you forgot "just one page" from OIT or "just one book from McKeldin" and just spend the vast majority of your evening walking to and from your dorm room to retrieve the necessary information to do your homework but never really getting any homework done. This is the situation the CPU is in.

(FYI: the previous paragraph assumed an empty cache, the worst case for a skip list)

Please, someone, help!

Enter the B+ tree. This structure is designed to avoid page faults. If you pick an appropriate order for the computer you're using (translation: pick an order such that the size of an internal node is as close to but still less than or equal to the size of a page) then you can avoid lots of unnecessary trips. Imagine that you thought ahead and checked out 10 books from McKeldin rather than just one. Well, as soon as you realize that the first one doesn't have the stuff you need, you've got another 9 just sitting
there. The CPU will be able to look at all of the values in an internal node very quickly if the internal node can fit into a page of memory and can thus fit into the data cache. (This last sentence is entirely true, but until you take 411 you won't know why, but it's not important for our purposes.)

So sure, you may page fault as you move from node to node in the B+ tree, but as long as your looking inside of a single node AND if you have picked an order that works AND if you design your B+ tree properly, then you shouldn't have any page faults while moving about inside of a single node.

Page faults don't happen every time you try to reference something outside of your current page, but they CAN happen. It's a worst case situation for it to happen every time, but much like many things in CS, we're always trying to avoid that worst cast or make it faster. The B+ tree does what it can to minimize the number of page faults that occur.

Internal Nodes With Know-how

So this brings us to a design question: how can I get an internal node to fit in a single page of memory? The key is to keep it simple. Stick to as many primitive types as you can. What you need to do is basically make your node class behave like a C-style struct. All of the data members' locations in memory have to be INSIDE of the class. You can't have a pointer off to some other location in memory. That's like having a post-it note that says "See the book in McKeldin" rather than checking it out. The only thing which Java stores INSIDE of the class are primitives and pointers to any member objects, but NOT the actual objects themselves. Again, just a post-in note telling the CPU to go page fault to main memory to get the values it wants.

Large objects, like skip lists, will not reside inside of the internal node. They will just be a pointer and you're going to start page faulting like crazy as soon as you try and traverse it.

To really be sure that your design fits inside of a page, you'd have to have pretty good knowledge about how the Java VM and API work to know how it creates objects and how those objects are laid out in memory. which goes against the OOP fundamentals that Java was built on. From my knowledge of Java, I don't think that you can *truly* ever make a *real* B+ tree. I think you really need to use C or C++ to enforce the memory layout requirements. But that's a moot point really since you're doing it in Java.

Just remember, keep your design as simple as possible. Don't use a skip list to store your keys. Bad cookie.

Appendix: Java Memory Layout

Due to security reasons, there is no specification about how memory layout is done in Java VM memory: http://www.artima.com/underthehood/overviewsecurity3.html

However, any object's memory signature looks like this (for Jikes compiled code): http://www-124.ibm.com/developerworks/oss/jikesrvm/userguide/HTML/userguide_19.html

In a nutshell, the object memory-footprint consists of private data members, a pointer to a table of function pointers, and some other book-keeping/security/synchronization information. No actual functions are stored with each object - rather, each object points to a table in some other location in memory which contains pointers to the code for each method (two layers of indirection if I understand it correctly).

So, when you call a function, the function along with all of it's function-scoped (temporary) variables are added to the stack. The only variables that are kept as a part of the object are the data members (public/protected/private).

So, when you make your classes and are concerned about page faults, be concerned about the size and number of data members in your class.

NOTE: All of this memory layout is compiler dependent, so changing compilers may change the number of page faults you have.

NOTE: These layout rules change with C++ and in C they are basically gone since you have nearly complete control (esp. at the OS level).

<MORE ADVANCED THAN YOU NEED TO KNOW>

Page faults, by definition, can happen anytime, anywhere. With an object, with the stack, with the code for a function... anything that's in memory that has been marked as "pagable" by the operating system is fair game for being swapped out of memory and on to disk. Thankfully, there are some good page replacement algorithms out there that do their very best to ensure that the page you will be least likely to use next will get swapped out rather than the one you want next. Fancy algorithms may even try to predict what page you will want next and may try to bring it in before you request it, thus reducing (or eliminating) the penalty of a page fault. Such algorithms, however, are often wrong if you are not doing sequential memory access, thus many systems simple do demand paging, which is when the system merely brings each page when it faults and does not try to do any prediction. This is what you will see most often in modern systems. Each of these algorithms, however, may (may, not will) fail to prevent page faults against an intelligent and malicious program that knows the number of available system resources and the algorithm being used. This is what is known as thrashing and can occur even in systems with non-malicious programs. The solution to thrashing (I kid you not): buy more RAM. Given cheap memory costs, it's actually easier to buy another stick of memory than actually modifying your algorithm. (After the OS has shipped, that is. If you're still debugging your OS, then you'd better try to fix that algorithm!)

</MORE ADVANCED THAN YOU NEED TO KNOW>

By Kevin Conroy
Back to the Main Page
Last updated August 25, 2004