CMSC 132 -- Project 7
Random Text Generator
Project Due: Sunday 5/1/05 at 11:00 PM
This project is A LOT OF WORK -- get going now!
This is a closed project. You are expected to do all of the work on this project without consulting with anyone other than the CMSC 132 instructors and TAs. Treat this project as though it were a take home exam. If you have any questions about whether or not a certain topic is okay to discuss with classmates or friends, please ask your instructor.
The goals of this project are:
Practice Java Collections framework
Lots of Practice Creating Iterators
Learn about Markov Chains
Generate crazy documents!
Your project will scan an existing text document and build a "Markov Transition Table" (explained later) that will be used to generate random text based on patterns from the original. Below is an example of a typical run of the program.
Input: A Scandal in Bohemia (A Sherlock Holmes story.)
Output: Randomly generated story (It almost looks like it was written by a human!)
OK, we know about finite state machines. A first-order Markov chain is a system where states are chosen randomly, with the probably distribution for the next state being determined only by the current state. For example, you could model weather prediction as the following Markov chain:
As part of this project, you will implement a "Dense Bag". A bag is like a Set, but with repeats allowed. A dense bag is a bag where we expect to get lots of repeats. In this project, bags will contain words, so a typical "dense bag" might contain:
[170 occurrences of "cow", 200 occurrences of "dog", 1717 occurrences of "cat"]
It is important to understand that the items in the bag are not ordered, so it doesn't make sense to talk about what is the "first" or "twelfth" thing in the bag. Note that in the example above you will not actually store 170 separate copies of the word "cow" in your DenseBag object (nor will you even store 170 separate references to the word "cow") -- rather you will create just one copy, and associate the number 170 with this copy to keep track of the number of occurrences of "cow" that are supposed to be in the bag.
Read the javadoc for the DenseBag class to see what methods you must implement and what they must do. Note that for the DenseBag class, you need to go beyond the functionality needed for this project, and provide a good, general-purpose implementation of a DenseBag. We will be testing features of your DenseBag implementation that are not needed for the Markov chain text generation.
It is possible to do all sorts of interesting and deep mathematical and scientific things with Markov chains, but that isn't what we are going to do! We're going to generate crazy documents.
Lots of people have had this idea, including:
Although lots of people have done this before, we are going to take our own, unique OO approach to the problem.
Let's suppose we are going to generate random text based on the following input stream:
This is a nice sentence. This is another sentence. This is another sentence.
The first thing the program does is to transfer the contents of the input stream into a list of words, adding the word "<START>" to the beginning and "<END>" to the end. (This is all done for you.) The list we would get in this example would contain the following 15 entries:
[<START>, This, is, a, nice, sentence., This, is, another, sentence., This, is, another, sentence., <END>]
From the list of words described above, the program will build a "transition table", which will be a map. (You will implement the heart of this process.) For a Markov Chain of Order-n, the map will use Lists of (typically) n words as the keys. The values that the map assigns to the keys will be DenseBags containing words.
Here's how it works. Suppose we are generating a Markov Transition Table of Order-2. Let's consider a typical sequence of two words taken from the input in the example above. The sequence of two words [This, is] occurs in three places in the input. In two places, "This is" is followed by the word "another". In the other spot, "this is" is followed by "a". So in the transition table there would be an entry mapping the List [This, is] to a DenseBag containing [2 occurrences of "another", 1 occurrence of "a"].
Almost all entries in the transition table for an order-n Markov Chain will consist of a key with a List of n words mapped to a DenseBag of words. However, the first few words in the input will generate entries in the transition table where the key is a List of fewer than n words. That will allow the random generation process to "get started". (See examples below.)
Below is the Transition Table for the order-2 Markov Chain generated by the text in the example we are considering:
Below is the transition table you would get if you were using order-3. Here we have two unusually short keys generated from the beginning of the input stream.
Your program should be able to generate a Markov Transition Table from a list of words using any order specified (1, 2, 3, 17, whatever...)
Once your transition table is complete, you can use it to generate random text. (You will implement this by creating an Iterator that will generate the random text one word at a time.) Let's do an example using the Order-2 Transition Table described above.
The output from the scenario described above would be:
<START> This is another sentence. <END>
Not very interesting, but a useful illustration!
Higher order Markov chains need to be primed with more text. If you try to build a third order Markov chain with just a thousand lines of text, you will get back significant passages that are exact copies from the original.
While writing this project, feel free to create as many of your own methods and/or classes as you find useful. For example, while writing your DenseBag class, you may find it useful (although not necessary) to write your own "mutable integer" class. You will also probably want to write your own DenseBagIterator class (implementing Iterator), since we are expecting you to write a function the returns an Iterator over a DenseBag. Don't forget that inner classes may be useful for some of these ideas!
Your grade on this project will be calculated as follows: