Class KEAKeyphraseExtractor

java.lang.Object
  extended by KEAKeyphraseExtractor
All Implemented Interfaces:
OptionHandler

public class KEAKeyphraseExtractor
extends java.lang.Object
implements OptionHandler

Extracts keyphrases from the documents in a given directory. Assumes that the file names for the documents end with ".txt". Puts extracted keyphrases into corresponding files ending with ".key" (if those are not already present). Optionally an encoding for the documents/keyphrases can be defined (e.g. for Chinese text). Documents for which ".key" exists, are used for evaluation. Valid options are:

-l "directory name"
Specifies name of directory.

-m "model name"
Specifies name of model.

-v "vocabulary name"
Specifies name of vocabulary.

-f "vocabulary format"
Specifies format of vocabulary (text or skos).

-e "encoding"
Specifies encoding.

-n
Specifies number of phrases to be output (default: 5).

-t "name of class implementing stemmer"
Sets stemmer to use (default: SremovalStemmer).

-d
Turns debugging mode on.

-a
Also write stemmed phrase and score into ".key" file.

Version:
1.0
Author:
Eibe Frank ([email protected])

Constructor Summary
KEAKeyphraseExtractor()
           
 
Method Summary
 java.util.Hashtable collectStems()
          Collects the stems of the file names.
 void extractKeyphrases(java.util.Hashtable stems)
          Builds the model from the files
 boolean getAdditionalInfo()
          Get the value of AdditionalInfo.
 boolean getDebug()
          Get the value of debug.
 java.lang.String getDirName()
          Get the value of dirName.
 java.lang.String getEncoding()
          Get the value of encoding.
 java.lang.String getModelName()
          Get the value of modelName.
 int getNumPhrases()
          Get the value of numPhrases.
 java.lang.String[] getOptions()
          Gets the current option settings.
 Stemmer getStemmer()
          Get the Stemmer value.
 java.lang.String getVocabulary()
          Get the value of vocabulary name.
 java.lang.String getVocabularyFormat()
          Get the value of vocabulary format.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void loadModel()
          Loads the extraction model from the file.
static void main(java.lang.String[] ops)
          The main method.
 void setAdditionalInfo(boolean newAdditionalInfo)
          Set the value of AdditionalInfo.
 void setDebug(boolean newdebug)
          Set the value of debug.
 void setDirName(java.lang.String newdirName)
          Set the value of dirName.
 void setEncoding(java.lang.String newencoding)
          Set the value of encoding.
 void setModelName(java.lang.String newmodelName)
          Set the value of modelName.
 void setNumPhrases(int newnumPhrases)
          Set the value of numPhrases.
 void setOptions(java.lang.String[] options)
          Parses a given list of options controlling the behaviour of this object.
 void setStemmer(Stemmer newStemmer)
          Set the Stemmer value.
 void setVocabulary(java.lang.String newvocabulary)
          Set the value of vocabulary name.
 void setVocabularyFormat(java.lang.String newvocabularyFormat)
          Set the value of vocabulary format.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KEAKeyphraseExtractor

public KEAKeyphraseExtractor()
Method Detail

getAdditionalInfo

public boolean getAdditionalInfo()
Get the value of AdditionalInfo.

Returns:
Value of AdditionalInfo.

setAdditionalInfo

public void setAdditionalInfo(boolean newAdditionalInfo)
Set the value of AdditionalInfo.

Parameters:
newAdditionalInfo - Value to assign to AdditionalInfo.

getNumPhrases

public int getNumPhrases()
Get the value of numPhrases.

Returns:
Value of numPhrases.

getStemmer

public Stemmer getStemmer()
Get the Stemmer value.

Returns:
the Stemmer value.

setStemmer

public void setStemmer(Stemmer newStemmer)
Set the Stemmer value.

Parameters:
newStemmer - The new Stemmer value.

setNumPhrases

public void setNumPhrases(int newnumPhrases)
Set the value of numPhrases.

Parameters:
newnumPhrases - Value to assign to numPhrases.

getDebug

public boolean getDebug()
Get the value of debug.

Returns:
Value of debug.

setDebug

public void setDebug(boolean newdebug)
Set the value of debug.

Parameters:
newdebug - Value to assign to debug.

getEncoding

public java.lang.String getEncoding()
Get the value of encoding.

Returns:
Value of encoding.

setEncoding

public void setEncoding(java.lang.String newencoding)
Set the value of encoding.

Parameters:
newencoding - Value to assign to encoding.

getVocabulary

public java.lang.String getVocabulary()
Get the value of vocabulary name.

Returns:
Value of vocabulary name.

setVocabulary

public void setVocabulary(java.lang.String newvocabulary)
Set the value of vocabulary name.

Parameters:
newvocabulary - Value to assign to vocabulary name.

getVocabularyFormat

public java.lang.String getVocabularyFormat()
Get the value of vocabulary format.

Returns:
Value of vocabulary format.

setVocabularyFormat

public void setVocabularyFormat(java.lang.String newvocabularyFormat)
Set the value of vocabulary format.

Parameters:
newvocabularyFormat - Value to assign to vocabularyFormat .

getModelName

public java.lang.String getModelName()
Get the value of modelName.

Returns:
Value of modelName.

setModelName

public void setModelName(java.lang.String newmodelName)
Set the value of modelName.

Parameters:
newmodelName - Value to assign to modelName.

getDirName

public java.lang.String getDirName()
Get the value of dirName.

Returns:
Value of dirName.

setDirName

public void setDirName(java.lang.String newdirName)
Set the value of dirName.

Parameters:
newdirName - Value to assign to dirName.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options controlling the behaviour of this object. Valid options are:

-l "directory name"
Specifies name of directory.

-m "model name"
Specifies name of model.

-v "vocabulary name"
Specifies vocabulary name.

-f "vocabulary format"
Specifies vocabulary format.

-e "encoding"
Specifies encoding.

-n
Specifies number of phrases to be output (default: 5).

-d
Turns debugging mode on.

-a
Also write stemmed phrase and score into ".key" file.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current option settings.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

collectStems

public java.util.Hashtable collectStems()
                                 throws java.lang.Exception
Collects the stems of the file names.

Throws:
java.lang.Exception

extractKeyphrases

public void extractKeyphrases(java.util.Hashtable stems)
                       throws java.lang.Exception
Builds the model from the files

Throws:
java.lang.Exception

loadModel

public void loadModel()
               throws java.lang.Exception
Loads the extraction model from the file.

Throws:
java.lang.Exception

main

public static void main(java.lang.String[] ops)
The main method.