Roget's Thesaurus - Electronic Lexical Knowledge Base ELKB

ELKB Applications

The package applications contains several scripts for exploring the ELKB, Electronic Lexical Knowledge Base representing Roget's thesaurus (free version, 1911).

CONTENT

LexicalChain - detects lexical chains in a text
SemDist - measures semantic distance between two words or phrases
WordCluster - clusters words or phrases in a list according to their semantic similarity
WordPower - answers Reader's Digest WordPower type questions

All scripts were originally developed as a part of Mario Jarmasz' Master thesis at the University of Ottawa, Canada.

Before using the applications, make sure that ELKB is installed on your computer. The folder Resources contains files that can be used as input for each of these applications.

Lexical Chains

This program analyzes words and phrases that appear in the document and how are they related to each other according to Roget's thesaurus. It builds lexical chains, i.e. sequences of related words, that reflect the topics of this document.

Usage

java applications/LexicalChain -f <input_file> (-s <Elkb|HIndex>) (-d)

where
<input_file> is a text version of the document you want to process, e.g. Resources/train.txt
-s specifies the scoring function (either Elkb (default) or HIndex)
-d is a debugging option that turns on verbose output of the program.

HIndex

Homogeneity Index HIndex was implemented by O.Medelyan according to functions proposed in Barzilay & Elhadad (1997). The score of each lexical chain is:

score(chain) = allMembers*(1 - distictMembers/allMembers)

Only top chains that satisfy the following condition are presented:
score(chain) > average(scores) + 2*StDev(scores).
Their members are presented only once in the chain, sorted by their frequency. The leading member is the one with the highest frequency, which can be seen as a keyword.

Semantic Distance

SemDist measures the semantic distance between two words or phrases, on a scale from 4 (not similar) to 16 (very similar). There are two versions of the program:

1. SemDist - requires an input file, where words or phrases must be supplied in comma separated pairs on one line. An example of an input file is MillerCharles.txt. Output examples for different Thesaurus versions: 1987 or 1911 can also be found in the folder Resources.

Usage:
java applications/SemDist <input file>

where <input file> is a file with words pairs as described above, e.g. Resources/MillerCharles.txt

2. SemDist2Words takes two words as an input and computes their semantic similarity

Usage:
java applications/SemDist2Words <word1> <word2>

where <word1> and <word2> are two valid words or phrases, e.g. "painter" and "artist". When entering a phrase consisting of more than 1 word, take it into apostrophes.

Word Clusters

WordCluster measures the semantic distance between all combinations of words and phrases in a list. It also clusters them according to their membership in Roget's Heads. A sample input file is radioactive_materials.txt and output files 1987 and 1911, in the folder Resources.

Usage:
java applications/WordCluster <input file> <output file>

Word Power Game

WordPower answers Reader's Digest WordPower type questions:

Which of these words can be traced back to the ancient Greek fantasize?

to imagine
to endear
to enlarge
to lie

Usage:
java applications/WordPower <input file>

A sample input file is Resources/rd_july2000.txt. You will find answers detected with two versions of Roget's: 1987 and 1911.

Contact

Contact Olena Medelyan () for more information.

LexChain - SemDist - WordCluster - WordPower