 |
Roget's Thesaurus
Electronic Lexical Knowledge Base (ELKB)
Update! A newer version of this
resource is available at The
Open Roget's Project website.
This page presents the Electronic Lexical Knowledge Base ELKB, software
for accessing and exploring the Roget's thesaurus. It also provides solutions
for various natural language processing tasks.
Keywords: Roget's thesaurus, lexical database, lexical chains,
semantic distance.
DESCRIPTION
An
Electronic Lexical Knowledge Base (ELKB) is a model for a lexical
resource, implemented in software, for classifying, indexing, storing
and retrieving words with their senses and the connections that exist
between them. It relies on a rich data repository to do so. This model
defines explicit semantic relationships between words and word groups.
It maps out an automatic process for building an electronic lexicon. It
is electronic not only because it is encoded in a digital format, but
rather because it is computer-usable, or tractable. This ELKB has been
created from the machine readable text files with the contents of the
1987 Penguin's Roget's Thesaurus*. It must maintain the information available
in the printed Thesaurus while it is put in a tractable format.
*In this freely available version the 1987 edition was
replaced by Roget's Thesaurus from 1911, obtained from the Gutenberg
Project.
Content
All scripts were originally developed as a part of Mario
Jarmasz' Master thesis at the University
of Ottawa, Canada.
Roget's
thesaurus
Directory "roget_elkb" contains the electronic version of
the Roget's Thesaurus from 1911, obtained from the Gutenberg
Project. The files have been converted into a different format.
- the index file (newIndex.txt),
- the upper structure of the thesaurus with classes, sections, head
groups and head names, in a XML like format (rogetMap.rt)
- 1044 head files (in the directory "heads") that correspond
to single head entries in Roget.
The same directory also contains the license for usage this version
of the thesaurus, obtained from the Gutenberg website.
Read more about the Roget's
Thesaurus on Wikipedia.
- ELKB package (ca.site.elkb)
The ELKB (Electronic Lexical Knowledge Base) was created to access
the Roget's thesaurus, originally the 1987 Penguin edition, but here
the free available version described above.
- Application package (applications)
Practical applications that make use of the Roget's thesaurus are summarized
in this package. For example, a program for detecting lexical chains in
a document, or scripts for measuring semantic distance between two words
by analysing Roget's structure. Here is a detailed
description of the package.
- Script for testing and usage examples
TestELKB allows to query the ELKB for information about a word
or a phrase, and about two words or phrases and how they relate to each
other.
- Resources
- .exc files in "roget_elkb" are exception list files
for the morphological processing rules used by WordNet
1.7.1.
- AmBr.lst in "roget_elkb" is a list of 646 equivalent
British and American spellings
- stop.txt (main folder) contains 980 stopwords
- The folder Resources contains sample input files and output files
produced in the testing stage of ELKB.
- Documentation
See folder doc for Javadoc of ELKB
DOWNLOAD
ELKB is distributed under the GNU General
Public License. Before download please submit your name and e-mail
address. Click here to proceed...
Installation and Usage of the ELKB
- Extract the content of the archive and move folder roget_elkb into
your home directory.
- In the ELKB directory, run java TestELKBand select the
option 1. This will create the index.
- In the same application select 2 or 3, and follow the
instructions to query the ELKB. See an example
output.
- Now you can experiment with our applications.
Note: The java files were compiled with Java 1.5, if you don't
have this java version, you might have to recompile the code:
$ELKB$ javac ca/site/elkb/*.java
$ELKB$ javac tools/*.java
$ELKB$ javac *.java
LICENSE & COPYRIGHT
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
675 Mass Ave, Cambridge, MA 02139, USA.
Copyright © 2006
Mario Jarmasz and Stan
Szpakowicz
School of Information Technology
and Engineering (SITE)
University of Ottawa, 800 King Edward Avenue
Ottawa, Ontario, Canada, K1N 6N5
and
Olena Medelyan
Department of Computer Science,
The University of Waikato
Private Bag 3105, Hamilton, New Zealand
PUBLICATIONS
Jarmasz, M. and Szpakowicz, S. (2003a).
Roget's Thesaurus and Semantic Similarity. Proceedings of Conference
on Recent Advances in Natural Language Processing (RANLP 2003), Borovets,
Bulgaria, September, 212-219.
Jarmasz, M. and Szpakowicz, S. (2003b).
Not As Easy As It Seems: Automating the Construction of Lexical Chains
Using Roget's Thesaurus. Proceedings of the 16th Canadian Conference
on Artificial Intelligence (AI 2003), Halifax, Canada, June, 544-549.
Jarmasz, M. and Szpakowicz, S. (2001a).
The Design and Implementation of an Electronic Lexical Knowledge Base.
Proceeding of the 14th Biennial Conference of the Canadian Society
for Computational Studies of Intelligence (AI 2001), Ottawa, Canada,
June, 325-333.
Jarmasz, M. and Szpakowicz, S. (2001b).
Roget's Thesaurus: a Lexical Resource to Treasure. Proceedings
of the NAACL WordNet and Other Lexical Resources workshop. Pittsburgh,
June, 186-188.
CONTACT
Contact Olena Medelyan ( )
for more information.
|