Kea

Download

Kea is distributed under the GNU General Public License. The current version 5.0 allows free as well as controlled indexing. It uses the latest version of the Weka machine learning workbench.

KEA-5.0

easy to install and use, direct from your code or from the command line
free or controlled indexing, with any vocabulary in text or SKOS format
latest libraries, including Jena-2.4 and Weka-3.5.5
easily applicable to new languages and domains
distributed with sample vocabularies in 3 languages (en, es, fr)
contains sample documents in 3 languages for creating and testing models

Download:

Download Kea from its Google Code project page. It includes source code, required libraries, test data and documentation.

Also consider using Maui, an algorithm for topic indexing, which can be used for the same tasks as Kea, but offers additional features. Maui also allows indexing using Wikipedia as a controlled vocabulary.

Examples of controlled vocabularies that can be used with Kea (and Maui)

Agricultural thesaurus Agrovoc: skos, text.
Agrovoc is being developed by the United Nation's Food and Agriculture Organisation (FAO). Our SKOS version is the reduced version of the original, containing only English terms.
Medical Subject Headings (MeSH): skos.
Medical Subject Headings are maintained by the US National Library of Medicine and are used in the Medline online library. We suggest to use the skos version of the MeSH thesaurus, provided by the Vrije Universiteit Amsterdam
High Energy Physics (HEP) vocabulary: skos
HEP vocabulary is being developed by the DESY library and is used for indexing physics documents at CERN. The SKOS version is available from Alberto Pepe's document classification project.
Alcohol and drugs thesaurus (AOD thesaurus): text.
The AOD thesaurus was developed by the NIAAA (National Institute on Alcohol abuse and Alcoholism). We have converted the original format of the thesaurus into a form accessible for Kea.
More domain-specific controlled vocabularies in SKOS format

Documentation

Free or Controlled Indexing?

In free indexing, keyphrases are significant terms that appear in the document. Any document in the phrase is a potential keyphrase. The advantage of free indexing is that it can be applied to any document. The disadvantages are poor quality of extracted phrases (compared to controlled indexing) and the indexing is not consistent.

In controlled indexing, keyphrases are chosen from a controlled vocabulary (a dictionary, thesaurus, or a list of terms). It has the advantage that all documents are indexed in a consistent way disregarding their wording. For example, two documents, one about "laptops" and another one about "notebooks", would be indexed with the same term, which is the preferred term in the controlled vocabulary to describe this concept.

Older Versions

Kea-4.1 (ZIP, 6.6 MB) -- controlled indexing only

Readme Kea 4.1 - Installation and usage instructions
Javadoc Kea 4.1
Models for Kea-4.1: Agricultural (50 docs), Medicine (20 docs), Physics (29 docs).

Kea-4.0 (ZIP, 1 MB) -- controlled indexing for agricultural documents only.

Includes the older version of the Agrovoc thesaurus in text form
Models for KEA-4.0: Agricultural (20 docs)

Kea-3.0 (ZIP, 512 KB) -- free indexing only.

It is based on the original version, which has been re-implemented in Java. Version 3.0 additionally allows indexing German documents. Implementing further languages is straightforward.

The oldest version of Kea is still available for download. It is implemented in Perl and Java (and a little C) for Unix systems. It is not straightforward to install it; you will probably have to know a little about Perl and Java. We strongly recommend you read the README file before you attempt it, so that you know what's in store.

Download: Kea-1.1.4.tar.gz (364 KB).
Readme: Kea-1.1.4-README.txt

Here is a model for the old version of Kea that was trained on a collection of Computer Science Technical Reports and uses domain-specific keyphrase frequency information for better results.

Download the CSTR model: Kea-CSTR-model.tar.gz (1076 KB).

Other Resources

KEA has also been integrated into the NLP workbench GATE (http://gate.ac.uk). Please send queries regarding the KEA plugin for GATE to the GATE support mailing list (http://gate.ac.uk/mail/index.html).

There is a IKMV version of KEA 3.0 (for dotnet/C#) developed by Enrico Lu. It is available on his website: http://enricolu.myweb.hinet.net/.

History

Version 5.0 - Kea that combines controlled and free indexing. Works with the latest version of Weka,
Version 4.1 - Kea now works with any controlled vocabulary in SKOS format.
Version 4.0 - Kea for agricultural documents
Version 3.0 - Kea now also works for German documents
Version 2.0 - Kea is now fully Java-based
Version 1.1.4 - finally updated Kea-1.1.4-README.txt to cover building models, and added a count-lines.pl script to this end.
Version 1.1.3 - Moved Lynx command to script that checks for conditions that are likely to crash it.
Version 1.1.2 - Documentation, phrase length set at command-line.
Version 1.1.1 - Set output extension at command-line