|
Project #2 (Random Text Generation) |
CMSC 132 |
|
Due Date: Mon Mar 3, 6:00 pm |
Object-Oriented Programming II |
|
Type of Homework: Closed |
Spring 2008 |
Objective
The objective of this project to practice using the Java Collections Framework
This homework is considered a closed homework. Make sure you read the open/closed policy before continuing working on this project.
Project Clarifications
Any clarifications or corrections associated with this project will be available at:
http://www.cs.umd.edu/class/spring2008/cmsc132/projects/p2/clarifications.html
Overview
For this project you must implement the methods of the
DenseBag and
MarkovText classes. These methods will allow us to generate
random text. You will need to design and write additional classes in
order to help you implement the functionality of those two named classes. In
particular, you will need to write two classes that implement the
Iterator interface; one to provide an Iterator for DenseBag,
and one to implement the functionality of the
MarkovText.markovIterator(java.util.Random) method. You may use
inner classes if you wish.
Documentation for this project can be found at
javadoc documentation
Your project will be graded as follows:
Tests (90 %)
Public JUnit tests (30 %)
Release JUnit tests (60 %)
Student Tests (5 %)
Style (5 %)
OK, we know about finite state machines. A first-order Markov chain is a system where states are chosen randomly, with the probability distribution for the next state being determined only by the current state. For example, you could model weather prediction as the following Markov chain:
As part of this project, you will implement a "Dense Bag". A bag is like a Set, but with repeats allowed. A dense bag is a bag where we expect to get lots of repeats. In this project, bags will contain words, so a typical "dense bag" might contain:
[170 occurrences of "cow", 200 occurrences of "dog", 1717 occurrences of "cat"]
Note that in the example above you will not actually store 170 separate copies of the word "cow" in your DenseBag object (nor will you even store 170 separate references to the word "cow") -- rather you will create just one copy, and associate the number 170 with this copy to keep track of the number of occurrences of "cow" that are supposed to be in the bag.
Read the javadoc for the DenseBag class to see what methods you must implement and what they must do. Note that for the DenseBag class, you need to go beyond the functionality needed for this project, and provide a good, general-purpose implementation of a DenseBag. We will be testing features of your DenseBag implementation that are not needed for the Markov chain text generation.
We are going to read in a corpus of text (also referred to as "training sequence"), and use that to build a Markov chain transition table. For example, using a third-order Markov transition table, we might find that the word sequence "what was the" occurs three times. Twice it is followed by "matter" and once by "cause". In generating new text, if we have a sequence of words ending in "what was the", then we randomly choose which word to add next. One third of the time, we choose "cause", and two thirds of the time we choose "matter".
Lots of people have had this idea, including:
Although lots of people have done this before, we are going to take our own, unique OO approach to the problem.
Let's suppose we are going to generate random text based on the following input stream (also referred to as "training sequence").
This is a nice sentence. This is another sentence. This is another sentence.
From the list of words described above, the program will build a "transition table", which will be a map. For a markov chain of Order-n, the map will use Lists of n words as the keys. The values that the map assigns to the keys will be DenseBag objects containing words.
Here's how it works. Suppose we are generating a Markov Transition Table of Order-2. Let's consider a typical sequence of two words taken from the input in the example above. The sequence of two words [This, is] occurs in three places in the input. In two places, "This is" is followed by the word "another". In the other spot, "this is" is followed by "a". So in the transition table there would be an entry mapping the List [This, is] to a DenseBag containing [2 occurrences of "another", 1 occurrence of "a"].
[""] = {"a" = 1}
["a"] = {"a" = 1, "b" = 1}
["b"] = {"" = 1}
For an Order-2 Markov Transition Table trained on the sequence [ "a",
"a", "a", "b" ], we would generate the map:
["", ""] = {"a" = 1}
["", "a"] = {"a" = 1}
["a", "a"] = {"a" = 1, "b" = 1}
["a", "b"] = {"" = 1}
For an Order-n Markov Transition Table, the first map entry will have a list of n empty strings mapped to the first word in the training sequence. The map list entry representing the end of the training sequence, will map to a dense bag that contains an empty string (regardless of the order used). Keep in mind that other strings besides the empty string can be associated with this list entry (see Order-2 Markov Chain example below).
Example
Below is the Transition Table for the Order-2 Markov Chain generated by the following text:
This is a nice sentence. This is another sentence. This is another sentence.
["",""] = {"This" = 1}
["", This] = {"is" = 1}
[This, is] = {"a" = 1, "another" = 2}
[is, a] = {"nice" = 1}
[a, nice] = {"sentence." = 1}
[nice, sentence.] = {"This" = 1}
[sentence., This] = {"is" = 2}
[is, another] = {"sentence." = 2}
[another, sentence.] = {"This" = 1, "" = 1}
Below is the transition table you would get if you were using Order-3.
["","",""] = {"This" = 1}
["","", This] = {"is" = 1}
["", This, is] = {"a" = 1}
[This, is, a] = {"nice" = 1}
[is, a, nice] = {"sentence." = 1}
[a, nice, sentence.] = {"This" = 1}
[nice, sentence., This] = {"is" = 1}
[sentence., This, is] = {"another" = 2}
[This, is, another] = {"sentence." = 2}
[is, another, sentence.] = {"This" = 1, "" = 1}
[another, sentence., This] = {"is" = 1}
Your program should be able to generate a Markov Transition Table from a list of words using any order specified (1, 2, 3, 17, whatever...)
Once your transition table is complete, you can use it to generate random text. Let's do an example using the Order-2 Transition Table described above.
Additional Information
From the choose( ) method project documentation:
Your choose() method will be tested to ensure that:
To generate random numbers, use the nextInt(int n) method of the class Random, which will return a random integer value between 0 and n-1 with uniform probability.
To randomly choose between events with different probabilities, you can assign events to different ranges of numbers corresponding to the probabilities, then select the event corresponding to the random number generated within this range.
Example:
From the hashcode( ) method project documentation:
Your hashcode() method will be tested to ensure that:
In general, a hashcode( ) method should also be significantly less expensive to compute than the equals( ) method (otherwise why even use a hashcode?), but we will not be testing for efficiency.
The hashcode( ) method should thus compute the hashcode from the same attributes of an object that are used by the equals( ) method, but do it in a simpler/superficial way.
Example hashcode( ) implementations:
Simply converting an object to a String and using the string's hashcode( ) would not work if two equal objects generate different strings, and would probably be inefficient in any case since the toString() method would likely do more work than equals( ).
Markov Text Applet
You project includes a Java applet to help test your Markov text generator. The applet provides a GUI for entering/editing training data (or downloading training data directly from an URL), then displays the generated text. The GUI also allows you to specify the order of the Markov chains. Some training data you may use can be found at:
Honors Section Requirement
Students in the honor section must implement remove for the DenseBag iterator.
Submission
Submit your project using the submit project option associated with Eclipse.
Academic Integrity
Please make sure you read the academic integrity section of the syllabus so you understand what is permissible in our programming projects. We want to remind you that we check your project against other students' projects and any case of academic dishonesty will be referred to the University's Office of Judicial Program.