![]() |
Keywords: Kea, keyphrase extraction, text mining, metadata, machine learning.
Many documents, particularly academic papers, are accompanied by a set of keywords that the author has chosen to describe the document. They provide a kind of semantic metadata that is useful for a wide variety of purposes. Kea is an algorithm for extracting keyphrases from the text of a document. (We say keyphrases because not all keywords are words.) In real life, the Kea is one of New Zealand's native parrots, famed for theft, destroying cars and cameras, forming street gangs, pecking sheep to death for their delicious kidney fat, and other cutesy antics. Richard photographed this Kea as it climbed onto his car. The Kea algorithm is described in a paper on domain-specific keyphrase extraction. A complementary paper on practical automatic keyphrase extraction investigates the effect of changing various parameters on the algorithm's performance. Kea has been re-implemented in Java, with some small changes to the original algorithm. The new version should run on any platform that supports Java. Like the old version it is distributed under the GPL and available for download. For more information on the modified version please consult the documention that is included in the distribution. Kea-4.0 is an improved version that is designed for keyphrase extraction in agricultural domains (more information below). Several interfaces and phrase extraction systems are based on the keyphrases extracted by Kea. Kniles is a system for automatically generating hypertext from plain text using keyphrases to select link anchors and destinations. Phrasier is a browsing interface that is similar to Kniles, but more sophisticated because it is not restricted to HTML. Kea-3.0 automatically extracts keyphrases from the full text of documents. The set of all candidate phrases in a document are identified using rudimentary lexical processing, features are computed for each candidate, and machine learning is used to generate a classifier that determines which candidates should be assigned as keyphrases. Two features are used in the standard algorithm: TF.IDF and position of first occurance. The TF.IDF requires a corpus of text from which document frequencies can be calculated; the machine learning phase requires a set of training documents with keyphrases assigned. The success of the procedure can be evaluated on a large test corpus, in terms of how many author-assigned keyphrases are correctly identified (a measure that is subject to some caveats). We have also conducted evaluations using human assessors to rate keyphrases. Kea-4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture. It is based on Kea-3.0 but has a different candidate selection and term conflation strategy, new stemmers to choose from, and three new features. Here is a fuller description. Candidate selection. Kea-4.0 only selects terms from the domain-specific thesaurus Agrovoc. To achieve the best possible matching and a high degree of conflation, n-grams are transformed into pseudo phrases by removing stopwords, stemming, and sorting the words alphabetically to canonicalize their order. Additional stemmers can be selected: Porter, Paice, S-removal, and no stemming. New features include the length of the phrase in words, and node degree (how richly the term is connected in the thesaurus graph structure). The tables below show the titles and keyphrases for three computer science technical reports. Keyphrases extracted by Kea-3.0 are listed, along with those assigned by the author. Phrases that both the Author and Kea chose are in italics. Generally, the author phrases look a lot better. Kea occasionally chose simple phases like cut and gauge that are not really appropriate. Kea assigns the keyphrase garbage to the third paper, a classification the author is unlikely to agree with.
The tables below show the titles and keyphrases for three agricultural documents. Terms were assigned independently by 6 professional indexers; those that more than one indexer selected appear, along with terms extracted by Kea-4.0.
Algorithm and intial evaluation. Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-Manning C.G. (1999) "Domain-specific keyphrase extraction" Proc. Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Francisco, CA, pp. 668-673. Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning C.G. (1999) "KEA: Practical automatic keyphrase extraction." Proc. DL '99, pp. 254-256. (Poster presentation.) Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning C.G. (2000) "KEA: Practical automatic keyphrase extraction." Working Paper 00/5, Department of Computer Science, The University of Waikato. Subjective evaluation of keyphrases. Jones, S. and Paynter, G.W. (2001). "Human evaluation of Kea, an automatic keyphrasing system". First ACM/IEEE-CS Joint Conference on Digital Libraries, Roanoke, Virginia, June 24-29, 2001, ACM Press, pp148-156. Jones, S. and Paynter, G.W. (2002, in press) "An Evaluation of Document Keyphrase Sets" Journal of Digital Information Jones, S. and Paynter, G.W. (2002, in press) "Automatic extraction of document keyprases for use in digital libraries: evaluation and applications". Journal of the American Society for Information Science and Technology (JASIST) Applications. Jones S. and Paynter G. W. (1999) "Topic-based browsing within a digital library using keyphrases." Proc. DL'99, pp. 114-121. Gutwin C., Paynter G., Witten I.H., Nevill-Manning C., and Frank E. (1999) "Improving Browsing in Digital Libraries with Keyphrase Indexes." J. Decision Support Systems, vol. 27, nos 1-2, Nov. 1999, pp. 81-104. Jones, S., Lundy, S. and Paynter, G.W. (2002). "Interactive document summarisation using automatically extracted keyphrases". Hawai'i International Conference on System Sciences:Digital Documents: Understanding and Communication Track, Hawai'i, USA, January 7-11, 2002, IEEE-CS, pp101. Kea is distributed under the GNU General Public License. To install it on your system, use the jar utility included in every standard Java distribution to expand the archive file, and follow the instructions in the README file. (On Windows you can also use winzip to expand the archive.) Kea includes a cut-down version of the Weka machine learning workbench.
Kea 4.0 is an improved version of Kea that is designed for keyphrase extraction in agricultural domains:
KEA has also been integrated into the NLP workbench GATE (http://gate.ac.uk). Please send queries regarding the KEA plugin for GATE to the GATE support mailing list (http://gate.ac.uk/mail/index.html). There is a IKMV version of KEA 3.0 (for dotnet/C#) developed by Enrico Lu. It is available on his website: http://enricolu.myweb.hinet.net/. Recent changes:
The old version of Kea is still available for download. It is implemented in Perl and Java (and a little C) for Unix systems. It is not straightforward to install it; you will probably have to know a little about Perl and Java. We strongly recommend you read the README file before you attempt it, so that you know what's in store.
Contact Eibe Frank ([email protected]) or Olena Medelyan ([email protected]) for more information.
|