CMSC 330, fall 2016
Organization of Programming Languages
Project 1 - WordNet
WordNet is a semantic lexicon for the English language that is used extensively by computational linguists and cognitive scientists. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, a plant organ is a hypernym to plant root and plant root is a hypernym to carrot.
Getting StartedDownload the following zip archive p1.zip. It should include the following files:
To download p1.zip on grace, execute
The WordNet DAG.
Your first task is to build the WordNet graph: each vertex v is a non-negative integer that represents a synset, and each directed edge v->w
We now describe the two types of data files that you will use to create the WordNet digraph. The descriptions lay out the structures of valid input files.
Part 1: Graph Construction and Invalid Input Files
You may not assume that synsets and hypernyms files are validly-structured; any files that do not exactly follow the format described above are considered invalid. First, your program will read in the synsets file. If it is invalid, then your program should print invalid synsets followed by each invalid line in the order that they appear in; then, the program should promptly exit without doing anything else (the program will not scan the hypernyms file). In the following example, both an invalid synsets file and hypernyms file are provided, but since synsets are read first, the program will exit before scanning the hypernyms:
% ruby wordnet.rb inputs/synsets2.txt inputs/hypernyms2.txt isnoun inputs/isnoun1 invalid synsets ids: 1 synset: b id: 5 synset: g id: 6synset: eNext, your program will read in the hypernyms file. If it is invalid, then your program should print invalid hypernyms followed by each invalid line in the order that they appear in; then, the program should promptly exit without doing anything else. In the following example, the synsets file is valid, but the hypernyms file is invalid:
% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms2.txt isnoun inputs/isnoun1 invalid hypernyms from: z to: 2 from:3 to: 5 to: 7 from: 6If both files are valid, then your program will create a WordNet graph with synset nodes and hypernyms edges; how you choose to represent the graph is left to your own discretion. Choose an efficient representation; eventually you will have to work with synsets.txt and hypernyms.txt, which are large files. Use the other input files as examples, as they are much smaller and easier to work with. You may assume that validly-structured input files will always describe valid DAGs. Consequently, each DAG will have a common root, which will be important in Part 3. For example, the WordNet subgraph above has root "event".
When checking for validity, you may need to use the split or chomp method from the String class in order to handle the trailing newline characters (\n)
Part 2: WordNet Properties
Once the synsets and hypernyms files are read in, your program will compute various properties of the words, according to the command (mode) it is given. Here are three simple properties you'll compute:
1. isnoun: If we invoke your script with the mode isnoun, your script will take in an input file that contains a list of words. The provided code will process the input file and pass an array of words into isnoun. The method should return true if all of the words in the array are contained in the synsets, and false otherwise. For example,
% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt isnoun inputs/isnoun1 true %ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt isnoun inputs/isnoun2 false2. nouns: Returns the number of nouns in the synsets. The count should also include all instances of duplicate nouns. In the following example, there are 9 nouns, because each instance of "e" is counted:
% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt nouns 93. edges: Returns the number of edges in the WordNet graph you built from the hypernyms. For example,
% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt edges 9
Part 3: Length, Ancestor, and Root
In this part, you will calculate the shortest ancestral path between nouns. An ancestral path between two IDs
From the definition of a SAP, it is possible for there to be multiple SAPs between IDs v and w. The SAPs may either lead to a single common ancestor or
to different common ancestors. In the latter case,
Implement the following functions:
1. length(v, w): Let
length([1,2],[3,4]) = minimum of length(1,3); length(1,4); length(2,3); and length(2,4)
Returns -1 (which represents Infinity) instead if no SAP exists between any ID of
% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt length inputs/length1 3 %ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt length inputs/length4 -12. ancestor(v, w): Let
%ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt ancestor inputs/ancestor1 3 %ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt ancestor inputs/ancestor4 -13. root(v,w): Let
%ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt root inputs/root2 g h
Part 4: Outcast Detection
Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, most of us agree that George Bush and John Kennedy (two US presidents) are more related than are George Bush and chimpanzee (two primates). However, not most of us agree that George Bush and Eric Arthur Blair are related concepts. But if one is aware that George Bush and Eric Arthur Blair (aka George Orwell) are both communicators, then it becomes clear that the two concepts might be related.
We estimate the semantic relatedness of nouns A and B, denoted dist(A, B), as follows: If either A or B is not a WordNet noun, the distance is Infinity. Otherwise, the distance is the minimum length of the SAPs between any ID associated with A and any ID associated with B.Given a list of nouns A1, A2, ..., An, which noun is the least related to the others? To identify the outcast, for each noun compute the sum of the squares of the distance between the noun and every other one. For instance, the sum for noun Ai (denoted as di) is calculated as follows:
di = (dist(Ai, A1))2 + (dist(Ai, A2))2 + ... + (dist(Ai, An))2. The outcast(s) is At for which dt is the maximum.
outcast(nouns): Given an array of nouns, the method should return an array of the outcast(s) (any order, including duplicates). For this part, you may assume that all of the nouns in the array are contained in the synsets. The input files may contain duplicate instances of nouns; the handling of duplicates will be discussed below. For example,
%ruby wordnet.rb inputs/synsets.txt inputs/hypernyms.txt outcast inputs/outcast3 table
Among the nouns "horse zebra cat bear table" in the input file outcast3.txt, "table" is the outcast. What if instead, the input file was "horse zebra cat bear table table"? Notice that the formula above does not rely on the uniqueness of nouns. In the original input file, the calculation for "zebra" is as follows:
dist_zebra = (dist(zebra, horse))2 + (dist(zebra, zebra))2 + ... + (dist(zebra, table))2
In the modified input file, the calculation for "zebra" is different:
dist_zebra = (dist(zebra, horse))2 + (dist(zebra, zebra))2 + ... + (dist(zebra, table))2 + (dist(zebra, table))2
In contrast to the result of using the original input file, now "zebra" is the outcast.
Hints and Tips
You should submit a file wordnet.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking:ruby wordnet.rb <synset file> <hypernym file> <mode> <input file>
where <mode> describes what the tool should do (see above), and <input> names the file containing the input data.
Be sure to follow the project description exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.You can submit your project in two ways:
The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.
Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.
Original project was created by Alina Ene and Kevin Wayne at Princeton University. This course project is copyright of Dr. Anwar Mamat. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited without the express consent of the author.