@DefaultAnnotationForParameters(value=edu.umd.cs.findbugs.annotations.NonNull.class) @DefaultAnnotationForMethods(value=edu.umd.cs.findbugs.annotations.NonNull.class)

Package cs132.markov

This package provides classes for random generation of text using a Markov model.

See:
          Description

Class Summary
AllSherlockHolmes Provides main method to generate text using a third order Markov chain from all 12 Sherlock Holmes stories If a command line argument is supplied, writes output to that file.
AllSherlockHolmesByCharacters Provides main method to generate text using a fifth order character-based Markov chain from all 12 Sherlock Holmes stories If a command line argument is supplied, writes output to that file.
DenseBag<E> The DenseBag class implements a Set-like collection that allows duplicates (a lot of them).
MarkovText A MarkovText represents a Markov state transition table of a specific order that has been trained on some text.
PublicTests Some test cases distributed as part of the project.
SherlockHolmesMeetsHamlet Provides main method to generate text using a third order Markov chain from all 12 Sherlock Holmes stories and Hamlet.
Util This class provides some utility methods to create or train a Markov text generator from a Reader.
Words This class provides static utility methods for coverting text files to a list of words, and for printing a iterator of words in a nicely formatted manner.
 

Package cs132.markov Description

This package provides classes for random generation of text using a Markov model. It consists of a DenseBag implementation and a MarkovText class that uses DenseBags.

What you must implement

You must implement the methods of the DenseBag and MarkovText.

You will need to design and write additional classes in order to help you implement the functionality of those two named classes. In particular, you will need to write two classes that implement the Iterator interface; one to provide an Iterator for DenseBag, and one to implement the functionality of the MarkovText.markovIterator(java.util.Random) method. You may use inner classes if you wish.

You are provided with implementations of Words and Util.

Markov Chains

OK, we know about finite state machines. A (first-order) Markov chain is a system where states are choosen randomly, with the probably distribution for the next state being determined only by the current state. For example, you could model weather prediction as the following Markov chain:

In a second-order Markov chain, the probably of each state is based on the previous two states; in a third-order Markov chain, the probably of each state is based on the previous three states. More pointers on Markov Chains at:

Markov chain text generation

It is possible to do all sorts of interesting and deep mathematical and scientific things with Markov chains, but that isn't what we are going to do.

Instead, we are going to generate text based on a Markov chain. We are going to read in a corpus of text, and use that to build a Markov chain transition table. For example, using a third-order Markov transition table, we might find that the word sequence "what was the" occurs three times. Twice it is followed by "matter" and once by "cause". In generating new text, if we have been generating a sequence of words ending in "what was the", then we randomly choose which word to add next. One third of the time, we choose "cause", and two thirds of the time we choose "matter".

Lots of people have had this idea, including:

Although lots of people have done this before, we are going to take our own, unique OO approach to the problem.

You need to implement a DenseBag class and some static methods for working with Markov transition tables.

For the DenseBag class, you need to go beyond the functionality needed for this project, and provide a good, general-purpose implementation of a DenseBag. We will be testing features of your DenseBag implementation that are not needed for the Markov chain text generation.

More documentation in the classes themselves.

Markov chain order

Higher order Markov chains need to be trained on larger texts in order to get interesting results. If you try to build a third order Markov chain with just a thousand lines of text, you will get back significant passages that are exact copies from the original. If the training text is too small, then too much of the time there will be one possible word that could be generated next.

Sample Output

I've included data/SherlockHolmes1.txt, which is the text of "A Scandal in Bohemia", a Sherlock Holmes story.. I've also distributed data/SherlockHolmes1Generated.txt, which is text generated from data/SherlockHolmes1.txt by a second order word-level Markov chain.

In addition, data/SherlockHolmes2.txt ... data/SherlockHolmes12.txt are the text of other Sherlock Holmes stories. The text of all of these stories come from the Gutenberg project.

MarkovText class contain a main method that will do this generation (if you have the project implemented). You can, of course, modify it to process another file or use a different order Markov chain.