Roget's Thesaurus

Electronic Lexical Knowledge Base (ELKB)


description - download - license - publications - contact

Update! A newer version of this resource is available at The Open Roget's Project website.

This page presents the Electronic Lexical Knowledge Base ELKB, software for accessing and exploring the Roget's thesaurus. It also provides solutions for various natural language processing tasks.

Keywords: Roget's thesaurus, lexical database, lexical chains, semantic distance.

DESCRIPTION

An Electronic Lexical Knowledge Base (ELKB) is a model for a lexical resource, implemented in software, for classifying, indexing, storing and retrieving words with their senses and the connections that exist between them. It relies on a rich data repository to do so. This model defines explicit semantic relationships between words and word groups. It maps out an automatic process for building an electronic lexicon. It is electronic not only because it is encoded in a digital format, but rather because it is computer-usable, or tractable. This ELKB has been created from the machine readable text files with the contents of the 1987 Penguin's Roget's Thesaurus*. It must maintain the information available in the printed Thesaurus while it is put in a tractable format.

*In this freely available version the 1987 edition was replaced by Roget's Thesaurus from 1911, obtained from the Gutenberg Project.

Content

All scripts were originally developed as a part of Mario Jarmasz' Master thesis at the University of Ottawa, Canada.
  1. Peter Mark Roget,  1779 - 1869Roget's thesaurus

    Directory "roget_elkb" contains the electronic version of the Roget's Thesaurus from 1911, obtained from the Gutenberg Project. The files have been converted into a different format.
    • the index file (newIndex.txt),
    • the upper structure of the thesaurus with classes, sections, head groups and head names, in a XML like format (rogetMap.rt)
    • 1044 head files (in the directory "heads") that correspond to single head entries in Roget.
      The same directory also contains the license for usage this version of the thesaurus, obtained from the Gutenberg website.

    Read more about the Roget's Thesaurus on Wikipedia.

  2. ELKB package (ca.site.elkb)
  3. The ELKB (Electronic Lexical Knowledge Base) was created to access the Roget's thesaurus, originally the 1987 Penguin edition, but here the free available version described above.

  4. Application package (applications)

  5. Practical applications that make use of the Roget's thesaurus are summarized in this package. For example, a program for detecting lexical chains in a document, or scripts for measuring semantic distance between two words by analysing Roget's structure. Here is a detailed description of the package.

  6. Script for testing and usage examples

    TestELKB allows to query the ELKB for information about a word or a phrase, and about two words or phrases and how they relate to each other.

  7. Resources
    • .exc files in "roget_elkb" are exception list files for the morphological processing rules used by WordNet 1.7.1.
    • AmBr.lst in "roget_elkb" is a list of 646 equivalent British and American spellings
    • stop.txt (main folder) contains 980 stopwords
    • The folder Resources contains sample input files and output files produced in the testing stage of ELKB.
  8. Documentation

    See folder doc for Javadoc of ELKB

 

DOWNLOAD

ELKB is distributed under the GNU General Public License. Before download please submit your name and e-mail address. Click here to proceed...

Installation and Usage of the ELKB

  1. Extract the content of the archive and move folder roget_elkb into your home directory.
  2. In the ELKB directory, run java TestELKBand select the option 1. This will create the index.
  3. In the same application select 2 or 3, and follow the instructions to query the ELKB. See an example output.
  4. Now you can experiment with our applications.

Note: The java files were compiled with Java 1.5, if you don't have this java version, you might have to recompile the code:

$ELKB$ javac ca/site/elkb/*.java
$ELKB$ javac tools/*.java
$ELKB$ javac *.java

 

LICENSE & COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Copyright © 2006
Mario Jarmasz and Stan Szpakowicz
School of Information Technology and Engineering (SITE)
University of Ottawa, 800 King Edward Avenue
Ottawa, Ontario, Canada, K1N 6N5

and

Olena Medelyan
Department of Computer Science,
The University of Waikato
Private Bag 3105, Hamilton, New Zealand

 

PUBLICATIONS

Jarmasz, M. and Szpakowicz, S. (2003a). Roget's Thesaurus and Semantic Similarity. Proceedings of Conference on Recent Advances in Natural Language Processing (RANLP 2003), Borovets, Bulgaria, September, 212-219.

Jarmasz, M. and Szpakowicz, S. (2003b). Not As Easy As It Seems: Automating the Construction of Lexical Chains Using Roget's Thesaurus. Proceedings of the 16th Canadian Conference on Artificial Intelligence (AI 2003), Halifax, Canada, June, 544-549.

Jarmasz, M. and Szpakowicz, S. (2001a). The Design and Implementation of an Electronic Lexical Knowledge Base. Proceeding of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence (AI 2001), Ottawa, Canada, June, 325-333.

Jarmasz, M. and Szpakowicz, S. (2001b). Roget's Thesaurus: a Lexical Resource to Treasure. Proceedings of the NAACL WordNet and Other Lexical Resources workshop. Pittsburgh, June, 186-188.

CONTACT

Contact Olena Medelyan () for more information.


description - download - license - publications - contact