CMSC 132: Project (Random Text Generation)

Project #6 (Random Text Generation)	CMSC 132
Due Date: Friday April 13, 6:00 pm	Object-Oriented Programming II
Type of Homework: Closed	Spring 2007

Objective

The objective of this project to practice using the Java Collections Framework

This homework is considered a closed homework. Make sure you read the open/closed policy before continuing working on this project.

Project Clarifications

Any clarifications or corrections associated with this project will be available at: clarifications.html

Overview

For this project you must implement the methods of the DenseBag and MarkovText classes. These methods will allow us to generate random text. You will need to design and write additional classes in order to help you implement the functionality of those two named classes. In particular, you will need to write two classes that implement the Iterator interface; one to provide an Iterator for DenseBag, and one to implement the functionality of the MarkovText.markovIterator(java.util.Random) method. You may use inner classes if you wish.

Documentation for this project can be found at javadoc documentation

Your project will be graded as follows:

Tests (90 %)
- Public JUnit tests (30 %)
- Release JUnit tests (60 %)
Student Tests (5 %)
Style (5 %)

Markov Chains

OK, we know about finite state machines. A first-order Markov chain is a system where states are chosen randomly, with the probability distribution for the next state being determined only by the current state. For example, you could model weather prediction as the following Markov chain:

If it is sunny today, there is a 90% chance it will be sunny tomorrow, otherwise it will be cloudy.
If it is cloudy today, there is a 50% chance it will be sunny tomorrow, otherwise it will be cloudy.

In a second-order Markov chain, the probability of each state is based on the previous two states; in a third-order Markov chain, the probability of each state is based on the previous three states. More pointers on Markov Chains at:

DenseBag class

As part of this project, you will implement a "Dense Bag". A bag is like a Set, but with repeats allowed. A dense bag is a bag where we expect to get lots of repeats. In this project, bags will contain words, so a typical "dense bag" might contain:

[170 occurrences of "cow", 200 occurrences of "dog", 1717 occurrences of "cat"]

Note that in the example above you will not actually store 170 separate copies of the word "cow" in your DenseBag object (nor will you even store 170 separate references to the word "cow") -- rather you will create just one copy, and associate the number 170 with this copy to keep track of the number of occurrences of "cow" that are supposed to be in the bag.

Read the javadoc for the DenseBag class to see what methods you must implement and what they must do. Note that for the DenseBag class, you need to go beyond the functionality needed for this project, and provide a good, general-purpose implementation of a DenseBag. We will be testing features of your DenseBag implementation that are not needed for the Markov chain text generation.

Markov Chain Text Generation

We are going to read in a corpus of text (also referred to as "training sequence"), and use that to build a Markov chain transition table. For example, using a third-order Markov transition table, we might find that the word sequence "what was the" occurs three times. Twice it is followed by "matter" and once by "cause". In generating new text, if we have a sequence of words ending in "what was the", then we randomly choose which word to add next. One third of the time, we choose "cause", and two thirds of the time we choose "matter".

Lots of people have had this idea, including:

Programming Pearls chapter on generating text using a Markov process
Wikipedia entry on Mark Shaney
Markov text generation from email sent to god@heaven.com (Although why anybody would think God would use a .com domain is beyond me)
Fun with Markov chains

Although lots of people have done this before, we are going to take our own, unique OO approach to the problem.

Creating an Order-n Markov "Transition Table"

Let's suppose we are going to generate random text based on the following input stream (also referred to as "training sequence").

This is a nice sentence. This is another sentence. This is another sentence.

From the list of words described above, the program will build a "transition table", which will be a map. For a markov chain of Order-n, the map will use Lists of n words as the keys. The values that the map assigns to the keys will be DenseBag objects containing words.

Here's how it works. Suppose we are generating a Markov Transition Table of Order-2. Let's consider a typical sequence of two words taken from the input in the example above. The sequence of two words [This, is] occurs in three places in the input. In two places, "This is" is followed by the word "another". In the other spot, "this is" is followed by "a". So in the transition table there would be an entry mapping the List [This, is] to a DenseBag containing [2 occurrences of "another", 1 occurrence of "a"].

As part of your design, you have to figure out how to define the first and last map entries associated with a training sequence. One possibility is to use the empty string (which never occurs in training sequences) as a special marker. For example, for an Order-1 Markov Transition Table trained on the sequence [ "a", "a", "b" ], we might generate the map:

    [""]  = {"a" = 1}

    ["a"] = {"a" = 1, "b" = 1}

    ["b"] = {"" = 1}

For an Order-2 Markov Transition Table trained on the sequence [ "a", "a", "a", "b" ], we would generate the map:

    ["", ""]   = {"a" = 1}

    ["", "a"]  = {"a" = 1}

    ["a", "a"] = {"a" = 1, "b" = 1}

    ["a", "b"] = {"" = 1}

For an Order-n Markov Transition Table, the first map entry will have a list of n empty strings mapped to the first word in the training sequence. The map list entry representing the end of the training sequence, will map to a dense bag that contains an empty string (regardless of the order used). Keep in mind that other strings besides the empty string can be associated with this list entry (see Order-2 Markov Chain example below).

Example

Below is the Transition Table for the Order-2 Markov Chain generated by the following text:

This is a nice sentence. This is another sentence. This is another sentence.

["",""]               = {"This" = 1}

["", This]            = {"is" = 1}

[This, is]            = {"a" = 1, "another" = 2}

[is, a]               = {"nice" = 1}

[a, nice]             = {"sentence." = 1}

[nice, sentence.]     = {"This" = 1}

[sentence., This]     = {"is" = 2}

[is, another]         = {"sentence." = 2}

[another, sentence.] = {"This" = 1, "" = 1}

Below is the transition table you would get if you were using Order-3.

["","",""]                 = {"This" = 1}

["","", This]              = {"is" = 1}

["", This, is]             = {"a" = 1}

[This, is, a]              = {"nice" = 1}

[is, a, nice]              = {"sentence." = 1}

[a, nice, sentence.]       = {"This" = 1}

[nice, sentence., This]    = {"is" = 1}

[sentence., This, is]      = {"another" = 2}

[This, is, another]        = {"sentence." = 2}

[is, another, sentence.]   = {"This" = 1, "" = 1}

[another, sentence., This] = {"is" = 1"}

Your program should be able to generate a Markov Transition Table from a list of words using any order specified (1, 2, 3, 17, whatever...)

Generating Random Text using the Transition Table

Once your transition table is complete, you can use it to generate random text. Let's do an example using the Order-2 Transition Table described above.

To start things off, look up List ["",""] in the map. We get back a DenseBag containing [1 occurrence of "This"]. No choices here -- we will select the word "This" and now consider the list ["", This].
Okay, look up the List ["", This] in the map. We get back the DenseBag containing [1 occurrence of "is"]. Okay, still no choices. We take the word "is" and now consider the list [This, is].
Okay, look up the List [This, is] in the map. We get back the DenseBag containing [1 occurrence of "a", 2 occurrences of "another"]. This is where the randomization comes in! Since there are 3 total things in the bag, 1/3 of the time we will select the word "a" and 2/3 of the time we will select "another". Let's imagine that in this example we happened to choose "another". Now we'll consider [is, another].
Looking up [is, another] yields a dense bag containing [2 occurrences of "sentence."]. No choice here, we'll choose "sentence" and consider the List [another, sentence.]
Looking up [another, sentence.] yields a dense bag containing [1 occurrence of "This", 1 occurrence of ""]. Randomization will be used again! Since there are a total of 2 things in the bag, 1/2 of the time we will select "This" and 1/2 of the time we will select "". Let's imagine that in this example we happened to choose "". The process stops. If instead of "" we choose "This", then we will next consider [sentence.,This].

Markov Text Applet

You can download & add to your project a Java applet to help test your Markov text generator. The applet provides a GUI for entering/editing training data (or downloading training data directly from an URL), then displays the generated text. The GUI also allows you to specify the order of the Markov chains. Some training data you may use:

http://www.cs.umd.edu/class/spring2007/cmsc132/Projects/lamb.txt
http://www.cs.umd.edu/class/spring2007/cmsc132/Projects/beatles.txt
http://www.cs.umd.edu/class/spring2007/cmsc132/Projects/lincoln.txt
http://www.cs.umd.edu/class/spring2007/cmsc132/Projects/hamlet.txt
http://www.cs.umd.edu/class/spring2007/cmsc132/Projects/romeo_juliet.txt
http://www.cs.umd.edu/class/spring2007/cmsc132/Projects/SherlockHolmes1.txt

An example of the applet with a working Markov text generator is shown below. Note you can directly edit the training text. You can also re-generate multiple Markov text using the same training text.

Submission

Submit your project using the submit project option associated with Eclipse.

Academic Integrity

Please make sure you read the academic integrity section of the syllabus so you understand what is permissible in our programming projects. We want to remind you that we check your project against other students' projects and any case of academic dishonesty will be referred to the University's Office of Judicial Program

Web Accessibility