Kea-4.0 - Description

Kea-4.0 is a keyphrase extraction algorithm for controlled indexing of documents from the agricultural domain. Compared to Kea-3.0 it has a different candidate selection and term conflation strategy, new stemmers to choose and three new features.

Controlled indexing is realized with the domain-specific thesaurus Agrovoc that contains 16,600 descriptors and 10,600 non-descriptors. It defines three semantic relations between descriptors: links between related terms (RT) and links between broader terms (BT) and narrower ones (NT). All non-descriptors are connected by preferential links to descriptors, which avoids indexing of the same concept with different terms. The Agrovoc database is accessed through several text files that are stored in the directory AGROVOC.

Candidate selection is realized similar to Kea-3.0, but with reference to Agrovoc. To achieve the best possible matching and also to attain a high degree of conflation, each n-gram is transformed into a pseudo phrase in three steps:

  1. remove all stopwords from the n-gram
  2. stem the remaining terms to their grammatical roots
  3. sort them into alphabetical order.

This matches similar phrases such as "algorithm efficiency", "the algorithms' efficiency", "an efficient algorithm" and even "these algorithms are very efficient" to the same pseudo phrase "algorithm effici", where "algorithm" and "effici" are the stemmed versions for the corresponding full forms.

In the next step each pseudo phrase is matched against vocabulary terms, also represented as pseudo phrases. If they are the same, the n-gram is identified with the corresponding vocabulary term.

For semantic term conflation, non-descriptors are replaced by their equivalent descriptors using the links in the thesaurus (these are called USE-FOR links in the Agrovoc thesaurus employed in this work).

As an optional extension, the candidate set is enriched with all terms that are related to the candidate terms, even though they may not correspond to pseudo-phrases that appear in the document. For each candidate its one-path related terms (RTs, BTs and NTs), are included. To use this feature, set manually the private variable m_RELused to "true" in

Additional stemmers can be selected for both KEAModelBuilder and KEAKeyphraseExtractor with the "-t" option:

  • Porter Stemmer (
  • Paice Stemmer (
  • S-Removal (removing -s, -es endings, which is the first step in Porter Stemmer)
  • No Stemming (only case-folding).

New features in Kea-4.0 are length of phrase in words, node degree and phrase appearance. The former two are used by default, while the later can be selected manually, by setting the private variable m_APfeature to "true" in Using this feature only makes sense, when extended candidate selection with m_RELused = "true" is selected.

  1. Length of a phrase in words boosts the probability of candidate phrases with two- and more words being keyphrases.
  2. Node degree reflects how richly the term is connected in the thesaurus graph structure. The "degree" of a thesaurus term is the number of semantic links that connect it to other terms. In Kea-4.0 the number of links that connect the term to other candidate phrases is a numeric feature.
  3. Appearance is a binary attribute that reflects whether the pseudo-phrase corresponding to a term actually appears in the document. Using the optional extension of candidate terms mentioned above, some candidate terms may not appear in the document.