CMSC 330, Spring 2016

Organization of Programming Languages

Project 1 - WordNet

Due Monday, Feb 15 Tuesday Feb 16, 2016 11:59pm

Errata

  • Updated on Feb 2 at 3:00pm: .submit file is added to pz.zip
  • Updated on Feb 1 at 9:00pm: An empty folder "student_outputs" is added.
  • Updated on Jan 31 at 8:00pm: test.rb is added to p1.zip.
  • Updated on Jan 31 at 8:00pm: hypernyms2.txt changed from space-delimited to comma-delimited.
  • Updated on Jan 30 at 11:00pm: Example given for "nouns" was wrong.

Introduction

WordNet is a semantic lexicon for the English language that is used extensively by computational linguists and cognitive scientists. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, a plant organ is a hypernym to plant root and plant root is a hypernym to carrot.

Getting Started

Download the following zip archive p1.zip. It should include the following files:

To download p1.zip on grace, execute

wget www.cs.umd.edu/class/spring2016/cmsc330/projects/p1/p1.zip

The WordNet DAG. Your first task is to build the WordNet graph: each vertex v is an integer that represents a synset, and each directed edge v->w represents that w is a hypernym of v. The graph is directed and acyclic (a DAG), though not necessarily a tree since each synset can have several hypernyms. A small subgraph of the WordNet graph is illustrated below.

We now describe the two data files that you will use to create the WordNet digraph. The files are in CSV format: each line contains a sequence of fields, separated by commas.

  • List of noun synsets. The synsets file ( For example synsets.txt) lists all the (noun) synsets in WordNet. The first field is the synset id (an integer), the second field is the synset, and the third field is its dictionary definition (or gloss). For example, the following line
    45,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire 
    means that the synonym set whose elements are AND_circuit and AND_gate has an id number of 45, and it's gloss is a circuit in a computer that fires only when all of its inputs fire. The individual nouns that comprise a synset are separated by spaces (and a synset element is not permitted to contain a space). We do not use the glasses in this project. Therefore, the some test input files do not have glosses.

  • List of hypernyms. The file hypernyms.txt contains the hypernym relationships: The first field is a synset id; subsequent fields are the id numbers of the synset's hypernyms. For example, the following line
    171,22798,57458 

    means that the the synset 171 ("Actifed") has 2 hypernyms: 22798 ("antihistamine") and 57458 ("nasal_decongestant"), representing that Actifed is both an antihistamine and a nasal decongestant. The synsets are obtained from the corresponding lines in the file synsets.txt.

    171,Actifed,trade name for a drug containing an antihistamine and a decongestant...
    22798,antihistamine,a medicine used to treat allergies...
    57458,nasal_decongestant,a decongestant that provides temporary relief of nasal...
          
  • A noun can appear in more than one synset. A noun will appear once for each meaning that the noun has. For example, here are all of the entries in synsets.txt that include the noun word.
    37559,discussion give-and-take word,an exchange of views on some topic; "we had a good discussion"; "we had a word or two about it" 
    50266,news intelligence tidings word,new information about specific and timely events; "they awaited news of the outcome" 
    60429,parole word word_of_honor,a promise; "he gave his word" 
    60430,password watchword word parole countersign,a secret word or phrase known only to a restricted group; "he forgot the password" 
    80883,word,a unit of language that native speakers can identify; "words are the blocks from which sentences are made"; "he hardly said ten words all morning" 
    80884,word,a brief statement; "he didn't say a word about it" 
    80885,word,a verbal command for action; "when I give the word  charge!" 
    80886,word,a word is a string of bits stored in computer memory; "large computers use words up to 64 bits long" 

Part 1: WordNet Properties

The first thing your program will do, of course, is to read in the synsets ans hypernyms files and build synsets map and hypernyms directed acyclic graph. You may assume that synsets and hypernyms files are valid.

Once the synsets and hypernyms files are read in, your program will compute various properties of the words, according to the command (mode) it is given. Here are three simple properties you'll compute: is a given word a noun in the synsets, the number of vertices and edges in your wornet graph.

First, if we invoke your script with the mode isNoun, your script should output true if all words listed in the input file are nouns in the synsets. If any word in the given input file is not a noun, you script should output false. For example,

% ruby wordnet.rb inputs/synsets11.txt inputs/hypernyms11.txt isnoun inputs/isnoun1
true

%ruby wordnet.rb inputs/synsets11.txt inputs/hypernyms11.txt isnoun inputs/isnoun2
false

Second, if we invoke your script with the nouns mode, your script should output the number of nouns in the synsets. For example,

% ruby wordnet.rb inputs/synsets14.txt inputs/hypernyms14.txt nouns
9

Finally, if we invoke your script with the edges mode, your script should print the number of edges in the WordNet graph you built from the hypernyms. For example,

% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt edges
11

Part 2: Length, Ancestor, and Root

In this part, you will calculate the shortest ancestral path between nouns. An ancestral path between two vertices v and w in a digraph is a directed path from v to a common ancestor x, together with a directed path from w to the same ancestor x. A shortest ancestral path is an ancestral path of minimum total length. For example, in the graph at left, the shortest ancestral path between 3 and 11 has length 4 (with common ancestor 1). In the graph at right, one ancestral path between 1 and 5 has length 4 (with common ancestor 5), but the shortest ancestral path has length 2 (with common ancestor 0).

Implement following functions

1. length(v, w): returns length of shortest ancestral path between v and w. It returns -1 if such path does not exist. If we invoke your script with the length mode, your script should output the length of the shortest ancestral path between the two sets of synset ids given in the input file. For example,

% ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt length inputs/input1
4 %ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt length inputs/input2
2

2. ancestor(v, w): returns a common ancestor (synset id) of v and w that participates in a shortest ancestral path. It returns -1 if such path does not exist. If we invoke your script with the ancestor mode, your script should output the synset id of the shortest common ancestor between the two sets of synset ids given in the input file. For example,

%ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt ancestor inputs/input1
1 %ruby wordnet.rb inputs/synsets1.txt inputs/hypernyms1.txt ancestor inputs/input3
-1

3. root(v,w): returns a closest common ancestor (noun) of v and w. v and w are nouns. It returns empty string if such path does not exist. If we invoke your script with the root mode, your script should output the closest coomon ancestor (one or more nouns) between the two nouns given in the input file. For example,

 %ruby wordnet.rb inputs/synsets11.txt inputs/hypernyms11.txt root inputs/root1
foo

Part 3: Measuring the semantic relatedness of two nouns

Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, most of us agree that George Bush and John Kennedy (two US presidents) are more related than are George Bush and chimpanzee (two primates). However, not most of us agree that George Bush and Eric Arthur Blair are related concepts. But if one is aware that George Bush and Eric Arthur Blair (aka George Orwell) are both communicators, then it becomes clear that the two concepts might be related.

We estimate the semantic relatedness of two nouns distance(A, B) as follows:

If either A or B is not a WordNet noun, the distance is infinity.

Otherwise, the distance is the minimum length of any ancestral path between any synset v of A and any synset w of B.

Outcast detection. Given a list of nouns A1, A2, ..., An, which noun is the least related to the others? To identify the outcast, compute the sum of the squares of the distance between each noun and every other one:

(di)2 = (dist(Ai, A1))2 + (dist(Ai, A2))2 + ... + (dist(Ai, An))2 and return the noun At for which dt is maximum.

Implement a function outcast(nouns) that prints the outcast noun in the input file. For exmample

  %ruby wordnet.rb inputs/synsets.txt inputs/hypernyms.txt outcast inputs/outcast1
table

Among the nouns "horse zebra cat bear table" in the input file outcast1.txt, "table" is the outcast.

Hints and Tips

  • To run public tests, execute "ruby tests/test_file_name". For example

    ruby tests/public_length1.rb

    You can also diff your output with the expected output in outputs folder.

  • This project is non-trivial, in part because you will probably be writing in Ruby for the first time, so be sure to start right away, and come to office hours if you get stuck.
  • Follow good program development practices: Test each part of your program as you develop it. Start developing a simplified solution and then add features as you are sure that earlier parts work. Test early and often, and re-run your tests as you add new features to be sure you didn't break anything.
  • Before you get too far, review the Ruby class reference, and look for classes and methods that might be helpful. For example, the Array and Hash classes will come in handy. Finding the right class might save you a lot of time and make your program easier to develop.
  • If you write methods that should return a true or false value, remember that a Ruby 0 is not false.
  • Ruby has an integrated debugger, which can be invoked by running Ruby with the -rdebug option. The debugger's p command may be helpful for viewing the values of variables and data structures. The var local command prints all of the local variables at the current point of exclusion. The chapter When Trouble Strikes of The Pragmatic Programmer's Guide discusses the debugger in more detail.
  • To thoroughly debug your program, you will need to construct test cases of your own, based on the project description. If you need help with this, please come to TA office hours.
  • Remember to save your work frequently---a power failure, network failure, or problem with a phone connection could cost many hours of lost work. For the same reason, submit your project often. You can retrieve previously-submitted versions of your program from the submit server should disaster strike.
  • Be sure you have read and understand the project grading policies in the course syllabus. Do this well in advance of the project due date.

Project Submission

You should submit a file wordnet.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking:

ruby wordnet.rb <synset file> <hypernym file> <mode> <input file>

where <mode> describes what the tool should do (see above), and <input> names the file containing the input data.

Be sure to follow the project description exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.

You can submit your project in two ways:
  • Submit your wordnet.rb file directly to the submit server by clicking on the submit link in the column "web submission".

    Next, use the submit dialog to submit your input.rb file directly.

    Select your file using the "Browse" button, then press the "Submit project!" button. You do not need to put it in a Jar or Zip file.

  • Submit directly by executing a Java program on a computer with Java and network access. Included in p1.zip are the following files:

    The files should be in the directory containing your project. From there you can either execute submit.rb, or type the following command directly:

    java -jar submit.jar

    The first time you submit this way you will be asked to enter your directory ID and password. All files in the directory (and its subdirectories) will then be put in a jar file and submitted to the submit server. If your submission is successful you will see the message:

    Successful submission # received for project 1

Academic Integrity

The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.

Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.

Copyright Notice

Original project was created by Alina Ene and Kevin Wayne at Princeton University. This course project is copyright of Dr. Anwar Mamat. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited without the express consent of the author.

 

Web Accessibility