===================================================================== ====== README ====== KEA 4.1 27 October 2006 http://www.nzdl.org/Kea/ Java Programs for Automatic Keyphrase Extraction with Controlled Vocabularies Copyright (C) 2000-2006 Eibe Frank, Olena Medelyan email: {eibe,olena}@cs.waikato.ac.nz ===================================================================== Contents: --------- 1. Installation 2. Getting started - Building a keyphrase extraction model - Extracting keyphrases - Important comment 3. History and Further Information 4. Copyright ---------------------------------------------------------------------- NOTE: ----- This distribution includes a cut-down version of WEKA, the GPL'ed machine learning workbench available from http://www.cs.waikato.ac.nz/ml/weka. The lib directory contains 5 libraries from the Jena-2.4 package: http://jena.sourceforge.net/ ---------------------------------------------------------------------- 1. Installation: ---------------- KEA is implemented as a set of Java classes (located in the same directory as this README file). To run KEA you need to tell the Java Virtual Machine where to look for KEA classes. One possible way of doing this is to add the directory that contains this README file to the CLASSPATH environment variable that is used by the Java Virtual Machine. Under Linux you would do the following: a) Set KEAHOME to be the directory which contains this README. b) Add $KEAHOME to your CLASSPATH environment variable. c) Add $KEAHOME/lib/*.jar to your CLASSPATH environment variable. d) Put your vocabulary file(s) into $KEAHOME/VOCABULARIES/. It needs to be a vocabulary either in the SKOS format, or in the text format required by KEA. See below (2.d) for details. If you don't have a vocabulary get one from KEA's website: http://www.nzdl.org/Kea/download.html#vocabularies The on-line documentation (generated from the source code) is located in the doc directory. You might want to do the following to have the documentation handy in you web browser: e) Bookmark $KEAHOME/doc/packages.html in your web browser. Kea has been developed in Java 1.5, on Eclipse under Linux. It has been also tested on Windows XP. ---------------------------------------------------------------------- 2. Getting started: ------------------- Building a keyphrase extraction model ===================================== To extract keyphrases for new documents, you first need to build a KEA keyphrase extraction model from a set of documents (preferably from the same domain) for which you have author-assigned keyphrases. To this end you have to go through the following steps: a) Create a directory, called, for example, "training_documents", containing the documents that you want to use for training the keyphrase extractor. b) Rename the document files in that directory so that they end with the suffix ".txt". (If your documents are PDFs, use pdftotext on Linux to convert them into text format.) c) Delete the author-assigned keyphrases from those documents and put them into separate ".key" files. For example, if your document file is called doc1.txt, move the keyphrases into a new file called "doc1.key". It is important that you put each keyphrase on a separate line in the .key file! d) Now you need to provide a controlled vocabulary. A list of controlled vocabularies is available on http://www.nzdl.org/Kea/download.html#vocabularies. You can use any other thesaurus in SKOS format. Their number is constantly increasing, latest SKOS vocabularies are listed on http://esw.w3.org/topic/SkosDev/DataZone. Make sure the SKOS vocabulary has the extension .rdf. You can also use a thesaurus in a plain text format, provided it is stored in the same manner as the Agrovoc thesaurus text files: .en, .use and .rel - .en is the main index with TermID and Term separated by a blank, one pair per line. - .use is the list of non-descriptor with corresponding descriptors in a format NonDescriptorID DescriptorID separated by a tab symbol, one pair per line. - .rel is the list of semantic relations between terms in a format TermID followed by the tab symbol, followed by the IDs of semantically related terms, separated by blank symbols. c) Build the keyphrase extraction model by running the KEAModelBuilder: java KEAModelBuilder -l -m -v -f This will use the documents in to build a keyphrase extraction model and save it in . If you use "-v agrovoc -f skos", Kea will search for "agrovoc.rdf" in the directory VOCABULARIES. KEAModelBuilder has a few other options that you can view if you run KEAModelBuilder without any arguments. Here is a list of all the options: -l Specifies name of directory. -m Specifies name of model. -e Specifies encoding. -d Turns debugging mode on. -k Use keyphrase frequency statistic. -p Disallow internal periods. -x Sets the maximum phrase length (default: 3). -y Sets the minimum phrase length (default: 1). -o The minimum number of times a phrase needs to occur (default: 2). -s Sets the list of stopwords to use (default: StopwordsEnglish). -t Set the stemmer to use (default: IteratedLovinsStemmer). The -e option allows you to specify a different character encoding supported by Java. For example, to extract keyphrases from Chinese documents encoded using GBK, you would use specify "-e GBK" as an argument. The -d option generates some output that shows the progress of the model builder. If -k is set, the keyphrase frequency attribute is used in the model. For more info on this, have a look at the paper on "Domain-specific keyphrase extraction" listed below. Using this option improves accuracy if the domain of the documents for which you want to extract keyphrases is the same as the domain of the training documents. In other words, if you want to extract keyphrases from papers on radiology, and your training documents are about radiology, you should use this option. If -p is set, KEA does not consider phrases with internal periods as candidate keyphrases. It is important to use this if a full stop is not always followed by white space in the documents. Using -s and -t you can set different classes for stopword detection and stemming respectively (for languages other than English). Extracting keyphrases ===================== To extract keyphrases for some documents, put them into an empty directory. Then rename them so that they end with the suffix ".txt". If you've previously built a keyphrase extraction model you can now apply keyphrases for these documents using: java KEAKeyphraseExtractor -l -m -v -f (See section "Building a keyphrase extraction model" above for information on the controlled vocabulary and its format.) This will create a ".key" file for each document in the directory. Each file will contain five extracted keyphrases for the corresponding document. If a ".key" file is already present it won't be overwritten. Instead, the keyphrases present in that file will be used to evaluate the extraction model. The stemmed extracted phrases are compared to the stemmed versions of the phrases in the ".key" file. KEAKeyphraseExtractor reports the number of hits among the total number of extracted phrases for those documents that have associated ".key" files in the directory. KEAKeyphraseExtractor has a few options. Here they are: -l Specifies name of directory. -m Specifies name of model. -e Specifies encoding. -n Specifies number of phrases to be output (default: 5). -d Turns debugging mode on. -a Also write stemmed phrase and score into ".key" file. Important comment ----------------- To get good results, it is important that the input text for KEA is as "clean" as possible. That means html tags etc. in the input documents need to be deleted before the model is built and before keyphrases are extracted from new documents. Also, make sure that you have enough documents in both training and extraction phase. For example, for training at least 20-30 manually indexed documents are required. It is important that manually assigned keyphrases in the files ".key" correspond to the entries in the controlled vocabulary that you use. ---------------------------------------------------------------------- 3. History and Further Information: ----------------------- Kea has been developed at the Digital Library Lab at the Waikato University in New Zealand. All information about Kea algorithm is available from the New Zealand Digital Library web site at http://www.nzdl.org/Kea/. Kea-4.1 is designed for indexing with keyphrases from a controlled vocabulary. It buils up on the free keyphrase extraction algorithm (Kea-3.0 and earlier versions), which was developed by Eibe Frank, Gordon Paynter, Craig Nevill-Manning and Carl Gutwin. Olena Medelyan (http://www.cs.waikato.ac.nz/~olena) re-implemented Kea-3.0 into Kea-4.0 for controlled keyphrase extraction as a part of her Master Thesis, for indexing of agricultural documents with terms from the agricultural thesaurus Agrovoc (http://www.fao.org/agrovoc/). Kea-4.1 is the extended version, which allows the usage of any controlled vocabulary in the specified text format (see below), or the SKOS format (http://www.w3.org/2004/02/skos/), a W3C RDF specification for encoding ontologies, thesauri, dictionaries and term lists. ----------------------------------------------------------------------- 4. Copyright: ------------- KEA is distributed under the GNU public license. * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. Please read the file COPYING in Kea home directory. -----------------------------------------------------------------------