Kea

Keywords: Kea, keyphrase extraction, text mining, metadata, machine learning.

on-line resources - how Kea works - examples - publications - download - contact

Many documents, particularly academic papers, are accompanied by a set of keywords that the author has chosen to describe the document. They provide a kind of semantic metadata that is useful for a wide variety of purposes. Kea is an algorithm for extracting keyphrases from the text of a document. (We say keyphrases because not all keywords are words.)

In real life, the Kea is one of New Zealand's native parrots, famed for theft, destroying cars and cameras, forming street gangs, pecking sheep to death for their delicious kidney fat, and other cutesy antics. Richard photographed this Kea as it climbed onto his car.

Online resources

The Kea algorithm is described in a paper on domain-specific keyphrase extraction. A complementary paper on practical automatic keyphrase extraction investigates the effect of changing various parameters on the algorithm's performance.

Kea has been re-implemented in Java, with some small changes to the original algorithm. The new version should run on any platform that supports Java. Like the old version it is distributed under the GPL and available for download. For more information on the modified version please consult the documention that is included in the distribution. Kea-4.0 is an improved version that is designed for keyphrase extraction in agricultural domains (more information below).

Several interfaces and phrase extraction systems are based on the keyphrases extracted by Kea. Kniles is a system for automatically generating hypertext from plain text using keyphrases to select link anchors and destinations. Phrasier is a browsing interface that is similar to Kniles, but more sophisticated because it is not restricted to HTML.

How Kea works

Kea-3.0 automatically extracts keyphrases from the full text of documents. The set of all candidate phrases in a document are identified using rudimentary lexical processing, features are computed for each candidate, and machine learning is used to generate a classifier that determines which candidates should be assigned as keyphrases. Two features are used in the standard algorithm: TF.IDF and position of first occurance. The TF.IDF requires a corpus of text from which document frequencies can be calculated; the machine learning phase requires a set of training documents with keyphrases assigned.

The success of the procedure can be evaluated on a large test corpus, in terms of how many author-assigned keyphrases are correctly identified (a measure that is subject to some caveats). We have also conducted evaluations using human assessors to rate keyphrases.

How Kea-4.0 works

Kea-4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture. It is based on Kea-3.0 but has a different candidate selection and term conflation strategy, new stemmers to choose from, and three new features. Here is a fuller description.

Candidate selection. Kea-4.0 only selects terms from the domain-specific thesaurus Agrovoc. To achieve the best possible matching and a high degree of conflation, n-grams are transformed into pseudo phrases by removing stopwords, stemming, and sorting the words alphabetically to canonicalize their order.

Additional stemmers can be selected: Porter, Paice, S-removal, and no stemming.

New features include the length of the phrase in words, and node degree (how richly the term is connected in the thesaurus graph structure).

Kea-3.0 examples

The tables below show the titles and keyphrases for three computer science technical reports. Keyphrases extracted by Kea-3.0 are listed, along with those assigned by the author. Phrases that both the Author and Kea chose are in italics.

Generally, the author phrases look a lot better. Kea occasionally chose simple phases like cut and gauge that are not really appropriate. Kea assigns the keyphrase garbage to the third paper, a classification the author is unlikely to agree with.

Protocols for secure, atomic transaction execution in electronic commerce

Neural multigrid for gauge theories and other disordered systems

Proof nets, garbage, and computations

Author

Kea-3.0

Author

Kea-3.0

Author

Kea-3.0

anonymity

atomicity

auction

electronic commerce

privacy

real-time

security

transaction

atomicity

auction

customer

electronic commerce

intruder

merchant

protocol

security

third party

transaction

disordered systems

gauge fields

multigrid

neural multigrid

neural networks

disordered

gauge

gauge fields

interpolation kernels

length scale

multigrid

smooth

cut-elimination

linear logic

proof nets

sharing graphs

typed lambda-calculus

cut

cut elimination

garbage

proof net

weakening

Kea-4.0 examples

The tables below show the titles and keyphrases for three agricultural documents. Terms were assigned independently by 6 professional indexers; those that more than one indexer selected appear, along with terms extracted by Kea-4.0.

The growing global obesity problem: Some policy options to address it

Overview of techniques for reducing bird predation at aquaculture facilities

Feeding Asian cities: Food production and processing issues

Indexers

Kea-4.0

Indexers

Kea-4.0

Indexers

Kea-4.0

developing countries

food consumption

overweight

taxes

prices

price policies

fiscal policies

feeding habits

nutritional requirements

diet

nutrition policies

food intake

developing countries

food consumption

overweight

taxes

price fixing

controlled prices

policies

body weight

saturated fats

aquaculture

damage

fencing

noise

scares

bird control

fishery production

predatory birds

control methods

fish culture

noxious birds

aquaculture

damage

fencing

noise

scares

birds

fishing operations

predators

ropes

food policies

food supply

towns

urbanization

urban agriculture

urban areas

food production

agricultural policies

rural urban relations

asia

state intervention

agricultural sector

suburban agriculture

food policies

food supply

towns

urbanization

urban agriculture

urban areas

food consumption

new products

Publications

Algorithm and intial evaluation.

Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-Manning C.G. (1999) "Domain-specific keyphrase extraction" Proc. Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Francisco, CA, pp. 668-673.

Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning C.G. (1999) "KEA: Practical automatic keyphrase extraction." Proc. DL '99, pp. 254-256. (Poster presentation.)

Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning C.G. (2000) "KEA: Practical automatic keyphrase extraction." Working Paper 00/5, Department of Computer Science, The University of Waikato.

Subjective evaluation of keyphrases.

Jones, S. and Paynter, G.W. (2001). "Human evaluation of Kea, an automatic keyphrasing system". First ACM/IEEE-CS Joint Conference on Digital Libraries, Roanoke, Virginia, June 24-29, 2001, ACM Press, pp148-156.

Jones, S. and Paynter, G.W. (2002, in press) "An Evaluation of Document Keyphrase Sets" Journal of Digital Information

Jones, S. and Paynter, G.W. (2002, in press) "Automatic extraction of document keyprases for use in digital libraries: evaluation and applications". Journal of the American Society for Information Science and Technology (JASIST)

Applications.

Jones S. and Paynter G. W. (1999) "Topic-based browsing within a digital library using keyphrases." Proc. DL'99, pp. 114-121.

Gutwin C., Paynter G., Witten I.H., Nevill-Manning C., and Frank E. (1999) "Improving Browsing in Digital Libraries with Keyphrase Indexes." J. Decision Support Systems, vol. 27, nos 1-2, Nov. 1999, pp. 81-104.

Jones, S., Lundy, S. and Paynter, G.W. (2002). "Interactive document summarisation using automatically extracted keyphrases". Hawai'i International Conference on System Sciences:Digital Documents: Understanding and Communication Track, Hawai'i, USA, January 7-11, 2002, IEEE-CS, pp101.

Download Kea

Kea is distributed under the GNU General Public License. To install it on your system, use the jar utility included in every standard Java distribution to expand the archive file, and follow the instructions in the README file. (On Windows you can also use winzip to expand the archive.) Kea includes a cut-down version of the Weka machine learning workbench.

Download Kea Version 3.0: Kea-3.0.zip (512 KB).

Kea 4.0 is an improved version of Kea that is designed for keyphrase extraction in agricultural domains:

Download Kea Version 4.0: Kea-4.0.zip (992 KB)
Download the Agrovoc model for Kea 4.0: FAO-20docs.zip (1184 KB)

The Agrovoc thesaurus is copyrighted by the UN Food and Agriculture Organization, and its use is free for non-profit applications.

KEA has also been integrated into the NLP workbench GATE (http://gate.ac.uk). Please send queries regarding the KEA plugin for GATE to the GATE support mailing list (http://gate.ac.uk/mail/index.html).

There is a IKMV version of KEA 3.0 (for dotnet/C#) developed by Enrico Lu. It is available on his website: http://enricolu.myweb.hinet.net/.

Recent changes:

Version 4.0 - Kea for agricultural documents
Version 3.0 - Kea now also works for German documents
Version 2.0 - Kea is now fully Java-based
Version 1.1.4 - finally updated Kea-1.1.4-README.txt to cover building models, and added a count-lines.pl script to this end.
Version 1.1.3 - Moved Lynx command to script that checks for conditions that are likely to crash it.
Version 1.1.2 - Documentation, phrase length set at command-line.
Version 1.1.1 - Set output extension at command-line

The old version of Kea is still available for download. It is implemented in Perl and Java (and a little C) for Unix systems. It is not straightforward to install it; you will probably have to know a little about Perl and Java. We strongly recommend you read the README file before you attempt it, so that you know what's in store.

The Kea README file: Kea-1.1.4-README.txt
Download Kea Version 1.1.4: Kea-1.1.4.tar.gz (364 KB).

Here is a model for the old version of Kea that was trained on a collection of Computer Science Technical Reports and uses domain-specific keyphrase frequency information for better results.

Download the CSTR model: Kea-CSTR-model.tar.gz (1076 KB).

Contact Eibe Frank ([email protected]) or Olena Medelyan ([email protected]) for more information.

on-line resources - how Kea works - examples - publications - download - contact