CMSC 132 -- Project 7

Random Text Generator

Project Due:  Sunday 5/1/05 at 11:00 PM

This project is A LOT OF WORK -- get going now!

## Closed Policy

This is a closed project.  You are expected to do all of the work on this project without consulting with anyone other than the CMSC 132 instructors and TAs.  Treat this project as though it were a take home exam.  If you have any questions about whether or not a certain topic is okay to discuss with classmates or friends, please ask your instructor.

## Goals

The goals of this project are:

• Practice Java Collections framework

• Lots of Practice Creating Iterators

• Generate crazy documents!

## Overview

Your project will scan an existing text document and build a "Markov Transition Table" (explained later) that will be used to generate random text based on patterns from the original.  Below is an example of a typical run of the program.

Input:   A Scandal in Bohemia  (A Sherlock Holmes story.)

Output:  Randomly generated story  (It almost looks like it was written by a human!)

## Markov Chains

OK, we know about finite state machines. A first-order Markov chain is a system where states are chosen randomly, with the probably distribution for the next state being determined only by the current state. For example, you could model weather prediction as the following Markov chain:

• If it is sunny today, there is a 90% chance it will be sunny tomorrow, otherwise it will be cloudy.
• If it is cloudy today, there is a 50% chance it will be sunny tomorrow, otherwise it will be cloudy.
In a second-order Markov chain, the probably of each state is based on the previous two states; in a third-order Markov chain, the probably of each state is based on the previous three states. More pointers on Markov Chains at:

## DenseBag class

As part of this project, you will implement a "Dense Bag".  A bag is like a Set, but with repeats allowed.  A dense bag is a bag where we expect to get lots of repeats.  In this project, bags will contain words, so a typical "dense bag" might contain:

[170 occurrences of "cow",  200 occurrences of "dog", 1717 occurrences of "cat"]

It is important to understand that the items in the bag are not ordered, so it doesn't make sense to talk about what is the "first" or "twelfth" thing in the bag.  Note that in the example above you will not actually store 170 separate copies of the word "cow" in your DenseBag object (nor will you even store 170 separate references  to the word "cow") -- rather you will create just one copy, and associate the number 170 with this copy to keep track of the number of occurrences of "cow" that are supposed to be in the bag.

Read the javadoc for the DenseBag class to see what methods you must implement and what they must do.  Note that for the DenseBag class, you need to go beyond the functionality needed for this project, and provide a good, general-purpose implementation of a DenseBag. We will be testing features of your DenseBag implementation that are not needed for the Markov chain text generation.

## Markov Chain Text Generation

It is possible to do all sorts of interesting and deep mathematical and scientific things with Markov chains, but that isn't what we are going to do!  We're going to generate crazy documents.

Lots of people have had this idea, including:

Although lots of people have done this before, we are going to take our own, unique OO approach to the problem.

### First Step

Let's suppose we are going to generate random text based on the following input stream:

This is a nice sentence.  This is another sentence.  This is another sentence.

The first thing the program does is to transfer the contents of the input stream into a list of words, adding the word "<START>" to the beginning and "<END>" to the end.  (This is all done for you.)  The list we would get in this example would contain the following 15 entries:

[<START>, This, is, a, nice, sentence., This, is, another, sentence., This, is, another, sentence., <END>]

### Creating an Order-n Markov "Transition Table"

From the list of words described above, the program will build a "transition table", which will be a map.  (You will implement the heart of this process.)  For a Markov Chain of Order-n, the map will use Lists of (typically) n words as the keys.  The values that the map assigns to the keys will be DenseBags containing words.

Here's how it works.  Suppose we are generating a Markov Transition Table of Order-2.   Let's consider a typical sequence of two words taken from the input in the example above.   The sequence of two words [This,  is] occurs in three places in the input.  In two places, "This is" is followed by the word "another".  In the other spot, "this is" is followed by "a".  So in the transition table there would be an entry mapping the List [This, is] to a DenseBag containing [2 occurrences of "another", 1 occurrence of "a"].

Almost all entries in the transition table for an order-n Markov Chain will consist of a key with a List of n words mapped to a DenseBag of words.  However, the first few words in the input will generate entries in the transition table where the key is a List of fewer than n words.  That will allow the random generation process to "get started".  (See examples below.)

Below is the Transition Table for the order-2 Markov Chain generated by the text in the example we are considering:

• [] --> [1 occurrence of "<START>"]                                          // Unusual key for order-2.  It's an empty List!
• [<START>]  --> [1 occurrence of "This"]                                   // Unusual key for order-2.  It's a List with only one entry!
• [<START>, This] --> [1 occurrence of "is"]
• [This, is] --> [1 occurrence of "a", 2 occurrences of "another"]
• [is, a] --> [1 occurrence of "nice"]
• [a, nice] --> [1 occurrence of "sentence."]
• [nice, sentence.] --> [1 occurrence of "This"]
• [sentence., This] --> [2 occurrences of "is"]
• [is, another] --> [2 occurrences of "sentence."]
• [another, sentence.] --> [1 occurrence of "This", 1 occurrence of "<END>"]

Below is the transition table you would get if you were using order-3.  Here we have two unusually short keys generated from the beginning of the input stream.

• [] --> [1 occurrence of "<START>"]                                    // Unusual key for order-3.  It's an empty List!
• [<START>] --> [1 occurrence of "This"]                              // Unusual key for order-3.  It's a List with only one entry!
• [<START>, This] --> [1 occurrence of "is"]                          // Unusual key for order-3.  It's a List with only two entries!
• [<START>, This, is] --> [1 occurrence of "a"]
• [This, is, a] --> [1 occurrence of "nice"]
• [is, a, nice] --> [1 occurrence of "sentence."]
• [a, nice, sentence.] --> [1 occurrence of "This"]
• [nice, sentence., This] --> [1 occurrence of "is"]
• [sentence., This, is] --> [2 occurrences of "another"]
• [This, is, another] --> [2 occurrences of "sentence."]
• [is, another, sentence.] --> [1 occurrence of "This", 1 occurrence of "<END>"]

Your program should be able to generate a Markov Transition Table from a list of words using any order specified (1, 2, 3, 17, whatever...)

### Generating Random Text using the Transition Table

Once your transition table is complete, you can use it to generate random text.  (You will implement this by creating an Iterator that will generate the random text one word at a time.)  Let's do an example using the Order-2 Transition Table described above.

• To start things off, look up the one-word List [<START>] in the map.  We get back a DenseBag containing [1 occurrence of "This"].  No choices here -- we will select the word "This" and now consider the two-word list [<START>, This].
• Okay, look up the List [<START>, This] in the map.  We get back the DenseBag containing [1 occurrence of "is"].  Okay, still no choices.  We take the word "is" and now consider the two-word list [This, is].
• Okay, look up the List [This, is] in the map.  We get back the DenseBag containing [1 occurrence of "a", 2 occurrences of "another"].  This is where the randomization comes in!  Since there are 3 total things in the bag, 1/3 of the time we will select the word "a" and 2/3 of the time we will select "another".  Let's imagine that in this example we happened to choose "another".  Now we'll consider [is, another].
• Looking up [is, another] yields a dense bag containing [2 occurrences of "sentence."].  No choice here, we'll choose "sentence" and consider the List [another, sentence.]
• Looking up [another, sentence] yields a dense bag containing [1 occurrence of "This", 1 occurrence of "<END>"].  Randomization will be used again!  Since there are a total of 2 things in the bag, 1/2 of the time we will select "This" and 1/2 of the time we will select "<END>".  Let's imagine that in this example we happened to choose "<END>".  The process stops.

The output from the scenario described above would be:

<START> This is another sentence. <END>

Not very interesting, but a useful illustration!

### Markov chain order

Higher order Markov chains need to be primed with more text. If you try to build a third order Markov chain with just a thousand lines of text, you will get back significant passages that are exact copies from the original.

## Your Own Methods and Classes

While writing this project, feel free to create as many of your own methods and/or classes as you find useful.  For example, while writing your DenseBag class, you may find it useful (although not necessary) to write your own "mutable integer" class.  You will also probably want to write your own DenseBagIterator class (implementing Iterator), since we are expecting you to write a function the returns an Iterator over a DenseBag.  Don't forget that inner classes may be useful for some of these ideas!