Project #5 (Random Text Generation)

CMSC 132

Due Date: Thursday April 6, 6:00 pm

Object-Oriented Programming II

Type of Homework: Closed

Spring 2006


Objective

The objective of this project to practice using the Java Collections Framework

This homework is considered a closed homework.  Make sure you read the Open/Closed policy before continuing working on this project. 

Overview

For this project you must implement the methods of the DenseBag and MarkovText classes.  These methods will allow us to generate random text. You will need to design and write additional classes in order to help you implement the functionality of those two named classes. In particular, you will need to write two classes that implement the Iterator interface; one to provide an Iterator for DenseBag, and one to implement the functionality of the MarkovText.markovIterator(java.util.Random) method. You may use inner classes if you wish.

Documentation for this project can be found at javadoc documentation

Your project will be graded as follows:

Markov Chains

OK, we know about finite state machines. A first-order Markov chain is a system where states are chosen randomly, with the probability distribution for the next state being determined only by the current state. For example, you could model weather prediction as the following Markov chain:

In a second-order Markov chain, the probability of each state is based on the previous two states; in a third-order Markov chain, the probability of each state is based on the previous three states. More pointers on Markov Chains at:

DenseBag class

As part of this project, you will implement a "Dense Bag".  A bag is like a Set, but with repeats allowed.  A dense bag is a bag where we expect to get lots of repeats.  In this project, bags will contain words, so a typical "dense bag" might contain: 

[170 occurrences of "cow",  200 occurrences of "dog", 1717 occurrences of "cat"]

Note that in the example above you will not actually store 170 separate copies of the word "cow" in your DenseBag object (nor will you even store 170 separate references  to the word "cow") -- rather you will create just one copy, and associate the number 170 with this copy to keep track of the number of occurrences of "cow" that are supposed to be in the bag.

Read the javadoc for the DenseBag class to see what methods you must implement and what they must do.  Note that for the DenseBag class, you need to go beyond the functionality needed for this project, and provide a good, general-purpose implementation of a DenseBag. We will be testing features of your DenseBag implementation that are not needed for the Markov chain text generation.

Markov Chain Text Generation

We are going to read in a corpus of text (also referred to as "training sequence"), and use that to build a Markov chain transition table. For example, using a third-order Markov transition table, we might find that the word sequence "what was the" occurs three times. Twice it is followed by "matter" and once by "cause". In generating new text, if we have a sequence of words ending in "what was the", then we randomly choose which word to add next. One third of the time, we choose "cause", and two thirds of the time we choose "matter".

Lots of people have had this idea, including:

Although lots of people have done this before, we are going to take our own, unique OO approach to the problem.

Creating an Order-n Markov "Transition Table"

Let's suppose we are going to generate random text based on the following input stream (also referred to as "training sequence").

This is a nice sentence.  This is another sentence.  This is another sentence.

From the list of words described above, the program will build a "transition table", which will be a map.  (You will implement the heart of this process.)  For a Markov Chain of Order-n, the map will use Lists of  n words as the keys.  The values that the map assigns to the keys will be DenseBag objects containing words. 

Here's how it works.  Suppose we are generating a Markov Transition Table of Order-2.   Let's consider a typical sequence of two words taken from the input in the example above.   The sequence of two words [This,  is] occurs in three places in the input.  In two places, "This is" is followed by the word "another".  In the other spot, "this is" is followed by "a".  So in the transition table there would be an entry mapping the List [This, is] to a DenseBag containing [2 occurrences of "another", 1 occurrence of "a"].

Almost all entries in the transition table for an order-n Markov Chain will consist of a key with a List of n words mapped to a DenseBag of words.  However, the first few words in the input will generate entries in the transition table where the key is a List of fewer than n words.  As part of your design, you have to figure out how to handle the start and end of training sequences. One possibility is to use the empty string (which never occurs in training sequences) as a special marker.

Below is the Transition Table for the order-2 Markov Chain generated by the text in the example we are considering:

Below is the transition table you would get if you were using order-3

Your program should be able to generate a Markov Transition Table from a list of words using any order specified (1, 2, 3, 17, whatever...)

Generating Random Text using the Transition Table

Once your transition table is complete, you can use it to generate random text.  Let's do an example using the Order-2 Transition Table described above. 

Submission

Submit your project using the submit project option associated with Eclipse. 

Academic Integrity

Please make sure you read the academic integrity section of the syllabus so you understand what is permissible in our programming projects.  We want to remind you that we check your project against other students' projects and any case of academic dishonesty will be referred to the University's Office of Judicial Program