CMSC 330, Spring 2016
Organization of Programming Languages
Project 1 - WordNet
WordNet is a semantic lexicon for the English language that is used extensively by computational linguists and cognitive scientists. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, a plant organ is a hypernym to plant root and plant root is a hypernym to carrot.
Getting StartedDownload the following zip archive p1.zip. It should include the following files:
To download p1.zip on grace, execute
The WordNet DAG.
Your first task is to build the WordNet graph: each vertex v is an integer that represents a synset, and each directed edge v->w represents that
We now describe the two data files that you will use to create the WordNet digraph. The files are in CSV format: each line contains a sequence of fields, separated by commas.
Part 1: WordNet Properties
The first thing your program will do, of course, is to read in the synsets ans hypernyms files and build synsets map and hypernyms directed acyclic graph. You may assume that synsets and hypernyms files are valid.
Once the synsets and hypernyms files are read in, your program will compute various properties of the words, according to the command (mode) it is given. Here are three simple properties you'll compute: is a given word a noun in the synsets, the number of vertices and edges in your wornet graph.
First, if we invoke your script with the mode isNoun, your script should output true if all words listed in the input file are nouns in the synsets. If any word in the given input file is not a noun, you script should output false. For example,
% ruby wordnet.rb inputs/synsets11.txt inputs/hypernyms11.txt isnoun inputs/isnoun1 true %ruby wordnet.rb inputs/synsets11.txt inputs/hypernyms11.txt isnoun inputs/isnoun2 false
Second, if we invoke your script with the nouns mode, your script should output the number of nouns in the synsets. For example,
% ruby wordnet.rb inputs/synsets14.txt inputs/hypernyms14.txt nouns 9
Finally, if we invoke your script with the edges mode, your script should print the number of edges in the WordNet graph you built from the hypernyms. For example,
% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt edges 11
Part 2: Length, Ancestor, and Root
In this part, you will calculate the shortest ancestral path between nouns. An ancestral path between two vertices
Implement following functions
1. length(v, w): returns length of shortest ancestral path between v and w. It returns -1 if such path does not exist. If we invoke your script with the length mode, your script should output the length of the shortest ancestral path between the two sets of synset ids given in the input file. For example,
% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt length inputs/input1
2. ancestor(v, w): returns a common ancestor (synset id) of v and w that participates in a shortest ancestral path. It returns -1 if such path does not exist. If we invoke your script with the ancestor mode, your script should output the synset id of the shortest common ancestor between the two sets of synset ids given in the input file. For example,
%ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt ancestor inputs/input1
3. root(v,w): returns a closest common ancestor (noun) of v and w. v and w are nouns. It returns empty string if such path does not exist. If we invoke your script with the root mode, your script should output the closest coomon ancestor (one or more nouns) between the two nouns given in the input file. For example,
%ruby wordnet.rb inputs/synsets11.txt inputs/hypernyms11.txt root inputs/root1
Part 3: Measuring the semantic relatedness of two nouns
Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, most of us agree that George Bush and John Kennedy (two US presidents) are more related than are George Bush and chimpanzee (two primates). However, not most of us agree that George Bush and Eric Arthur Blair are related concepts. But if one is aware that George Bush and Eric Arthur Blair (aka George Orwell) are both communicators, then it becomes clear that the two concepts might be related.
We estimate the semantic relatedness of two nouns distance(A, B) as follows:
If either A or B is not a WordNet noun, the distance is infinity.
Otherwise, the distance is the minimum length of any ancestral path between any synset v of A and any synset w of B.
Outcast detection. Given a list of nouns A1, A2, ..., An, which noun is the least related to the others? To identify the outcast, compute the sum of the squares of the distance between each noun and every other one:
(di)2 = (dist(Ai, A1))2 + (dist(Ai, A2))2 + ... + (dist(Ai, An))2 and return the noun At for which dt is maximum.
Implement a function outcast(nouns) that prints the outcast noun in the input file. For exmample
%ruby wordnet.rb inputs/synsets.txt inputs/hypernyms.txt outcast inputs/outcast1
Among the nouns "horse zebra cat bear table" in the input file outcast1.txt, "table" is the outcast.
Hints and Tips
You should submit a file wordnet.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking:ruby wordnet.rb <synset file> <hypernym file> <mode> <input file>
where <mode> describes what the tool should do (see above), and <input> names the file containing the input data.
Be sure to follow the project description exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.You can submit your project in two ways:
The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.
Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.
Original project was created by Alina Ene and Kevin Wayne at Princeton University. This course project is copyright of Dr. Anwar Mamat. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited without the express consent of the author.