Class KEAFilter

java.lang.Object
  extended by weka.filters.Filter
      extended by KEAFilter
All Implemented Interfaces:
java.io.Serializable, OptionHandler

public class KEAFilter
extends Filter
implements OptionHandler

This filter converts the incoming data into data appropriate for keyphrase classification. It assumes that the dataset contains two string attributes. The first attribute should contain the text of a document. The second attribute should contain the keyphrases associated with that document (if present). The filter converts every instance (i.e. document) into a set of instances, one for each word-based n-gram in the document. The string attribute representing the document is replaced by some numeric features, the estimated probability of each n-gram being a keyphrase, and the rank of this phrase in the document according to the probability. Each new instances also has a class value associated with it. The class is "true" if the n-gram is a true keyphrase, and "false" otherwise. Of course, if the input document doesn't come with author-assigned keyphrases, the class values for that document will be missing.

Version:
2.0
Author:
Eibe Frank ([email protected]), Olena Medelyan ([email protected])
See Also:
Serialized Form

Field Summary
static Vocabulary m_Vocabulary
          The Vocabulary object
 
Constructor Summary
KEAFilter()
           
 
Method Summary
 boolean batchFinished()
          Signify that this batch of input to the filter is finished.
 boolean getCheckForProperNouns()
          Get the M_CheckProperNouns value.
 boolean getDebug()
          Get the value of Debug.
 boolean getDisallowInternalPeriods()
          Get whether the supplied columns are to be processed
 int getDocumentAtt()
          Get the value of DocumentAtt.
 int getKeyphrasesAtt()
          Get the value of KeyphraseAtt.
 boolean getKFused()
          Gets whether keyphrase frequency attribute is used.
 int getMaxPhraseLength()
          Get the value of MaxPhraseLength.
 int getMinNumOccur()
          Get the value of MinNumOccur.
 int getMinPhraseLength()
          Get the value of MinPhraseLength.
 int getNumPhrases()
          Get the value of numPhrases.
 java.lang.String[] getOptions()
          Gets the current settings of the filter.
 int getProbabilityIndex()
          Returns the index of the phrases' probabilities in the output ARFF file.
 int getRankIndex()
          Returns the index of the phrases' ranks in the output ARFF file.
 int getStemmedPhraseIndex()
          Returns the index of the stemmed phrases in the output ARFF file.
 Stemmer getStemmer()
          Get the Stemmer value.
 Stopwords getStopwords()
          Get the M_Stopwords value.
 int getUnstemmedPhraseIndex()
          Returns the index of the unstemmed phrases in the output ARFF file.
 java.lang.String getVocabulary()
          Get the M_Vocabulary value.
 java.lang.String getVocabularyFormat()
          Get the M_VocabularyFormat value.
 java.lang.String globalInfo()
          Returns a string describing this filter
 boolean input(Instance instance)
          Input an instance for filtering.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
 void loadThesaurus(Stemmer st, Stopwords sw)
           
static void main(java.lang.String[] argv)
          Main method for testing this class.
 void setCheckForProperNouns(boolean newM_CheckProperNouns)
          Set the M_CheckProperNouns value.
 void setDebug(boolean newDebug)
          Set the value of Debug.
 void setDisallowInternalPeriods(boolean disallow)
          Set whether selected columns should be processed.
 void setDocumentAtt(int newDocumentAtt)
          Set the value of DocumentAtt.
 boolean setInputFormat(Instances instanceInfo)
          Sets the format of the input instances.
 void setKeyphrasesAtt(int newKeyphrasesAtt)
          Set the value of KeyphrasesAtt.
 void setKFused(boolean flag)
          Sets whether keyphrase frequency attribute is used.
 void setMaxPhraseLength(int newMaxPhraseLength)
          Set the value of MaxPhraseLength.
 void setMinNumOccur(int newMinNumOccur)
          Set the value of MinNumOccur.
 void setMinPhraseLength(int newMinPhraseLength)
          Set the value of MinPhraseLength.
 void setNumFeature()
          Sets whether Vocabulary relation attribute is used.
 void setNumPhrases(int newnumPhrases)
          Set the value of numPhrases.
 void setOptions(java.lang.String[] options)
          Parses a given list of options controlling the behaviour of this object.
 void setStemmer(Stemmer newStemmer)
          Set the Stemmer value.
 void setStopwords(Stopwords newM_Stopwords)
          Set the M_Stopwords value.
 void setVocabulary(java.lang.String newM_Vocabulary)
          Set the M_Vocabulary value.
 void setVocabularyFormat(java.lang.String newM_VocabularyFormat)
          Set the M_VocabularyFormat value.
static java.lang.String[] sort(java.lang.String[] a)
          Sorts an array of Strings into alphabetic order
static void swap(int loc1, int loc2, java.lang.String[] a)
           
 
Methods inherited from class weka.filters.Filter
batchFilterFile, filterFile, getOutputFormat, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputPeek, useFilter
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_Vocabulary

public static Vocabulary m_Vocabulary
The Vocabulary object

Constructor Detail

KEAFilter

public KEAFilter()
Method Detail

getVocabulary

public java.lang.String getVocabulary()
Get the M_Vocabulary value.

Returns:
the M_Vocabulary value.

setVocabulary

public void setVocabulary(java.lang.String newM_Vocabulary)
Set the M_Vocabulary value.

Parameters:
newM_Vocabulary - The new M_Vocabulary value.

getVocabularyFormat

public java.lang.String getVocabularyFormat()
Get the M_VocabularyFormat value.

Returns:
the M_VocabularyFormat value.

setVocabularyFormat

public void setVocabularyFormat(java.lang.String newM_VocabularyFormat)
Set the M_VocabularyFormat value.

Parameters:
newM_VocabularyFormat - The new M_VocabularyFormat value.

getCheckForProperNouns

public boolean getCheckForProperNouns()
Get the M_CheckProperNouns value.

Returns:
the M_CheckProperNouns value.

setCheckForProperNouns

public void setCheckForProperNouns(boolean newM_CheckProperNouns)
Set the M_CheckProperNouns value.

Parameters:
newM_CheckProperNouns - The new M_CheckProperNouns value.

getStopwords

public Stopwords getStopwords()
Get the M_Stopwords value.

Returns:
the M_Stopwords value.

setStopwords

public void setStopwords(Stopwords newM_Stopwords)
Set the M_Stopwords value.

Parameters:
newM_Stopwords - The new M_Stopwords value.

getStemmer

public Stemmer getStemmer()
Get the Stemmer value.

Returns:
the Stemmer value.

setStemmer

public void setStemmer(Stemmer newStemmer)
Set the Stemmer value.

Parameters:
newStemmer - The new Stemmer value.

getMinNumOccur

public int getMinNumOccur()
Get the value of MinNumOccur.

Returns:
Value of MinNumOccur.

setMinNumOccur

public void setMinNumOccur(int newMinNumOccur)
Set the value of MinNumOccur.

Parameters:
newMinNumOccur - Value to assign to MinNumOccur.

getMaxPhraseLength

public int getMaxPhraseLength()
Get the value of MaxPhraseLength.

Returns:
Value of MaxPhraseLength.

setMaxPhraseLength

public void setMaxPhraseLength(int newMaxPhraseLength)
Set the value of MaxPhraseLength.

Parameters:
newMaxPhraseLength - Value to assign to MaxPhraseLength.

getMinPhraseLength

public int getMinPhraseLength()
Get the value of MinPhraseLength.

Returns:
Value of MinPhraseLength.

setMinPhraseLength

public void setMinPhraseLength(int newMinPhraseLength)
Set the value of MinPhraseLength.

Parameters:
newMinPhraseLength - Value to assign to MinPhraseLength.

getNumPhrases

public int getNumPhrases()
Get the value of numPhrases.

Returns:
Value of numPhrases.

setNumPhrases

public void setNumPhrases(int newnumPhrases)
Set the value of numPhrases.

Parameters:
newnumPhrases - Value to assign to numPhrases.

getStemmedPhraseIndex

public int getStemmedPhraseIndex()
Returns the index of the stemmed phrases in the output ARFF file.


getUnstemmedPhraseIndex

public int getUnstemmedPhraseIndex()
Returns the index of the unstemmed phrases in the output ARFF file.


getProbabilityIndex

public int getProbabilityIndex()
Returns the index of the phrases' probabilities in the output ARFF file.


getRankIndex

public int getRankIndex()
Returns the index of the phrases' ranks in the output ARFF file.


getDocumentAtt

public int getDocumentAtt()
Get the value of DocumentAtt.

Returns:
Value of DocumentAtt.

setDocumentAtt

public void setDocumentAtt(int newDocumentAtt)
Set the value of DocumentAtt.

Parameters:
newDocumentAtt - Value to assign to DocumentAtt.

getKeyphrasesAtt

public int getKeyphrasesAtt()
Get the value of KeyphraseAtt.

Returns:
Value of KeyphraseAtt.

setKeyphrasesAtt

public void setKeyphrasesAtt(int newKeyphrasesAtt)
Set the value of KeyphrasesAtt.

Parameters:
newKeyphrasesAtt - Value to assign to KeyphrasesAtt.

getDebug

public boolean getDebug()
Get the value of Debug.

Returns:
Value of Debug.

setDebug

public void setDebug(boolean newDebug)
Set the value of Debug.

Parameters:
newDebug - Value to assign to Debug.

setKFused

public void setKFused(boolean flag)
Sets whether keyphrase frequency attribute is used.


setNumFeature

public void setNumFeature()
Sets whether Vocabulary relation attribute is used.


getKFused

public boolean getKFused()
Gets whether keyphrase frequency attribute is used.


getDisallowInternalPeriods

public boolean getDisallowInternalPeriods()
Get whether the supplied columns are to be processed

Returns:
true if the supplied columns won't be processed

setDisallowInternalPeriods

public void setDisallowInternalPeriods(boolean disallow)
Set whether selected columns should be processed. If true the selected columns won't be processed.

Parameters:
disallow - the new invert setting

loadThesaurus

public void loadThesaurus(Stemmer st,
                          Stopwords sw)

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options controlling the behaviour of this object. Valid options are:

-K
Specifies whether keyphrase frequency statistic is used.

-R
Specifies whether Vocabulary relation statistic is used.

-M length
Sets the maximum phrase length (default: 3).

-L length
Sets the minimum phrase length (default: 1).

-D
Turns debugging mode on.

-I index
Sets the index of the attribute containing the documents (default: 0).

-J index
Sets the index of the attribute containing the keyphrases (default: 1).

-P
Disallow internal periods

-O number
The minimum number of times a phrase needs to occur (default: 2).

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the filter.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

globalInfo

public java.lang.String globalInfo()
Returns a string describing this filter

Returns:
a description of the filter suitable for displaying in the explorer/experimenter gui

setInputFormat

public boolean setInputFormat(Instances instanceInfo)
                       throws java.lang.Exception
Sets the format of the input instances.

Overrides:
setInputFormat in class Filter
Parameters:
instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:
true if the outputFormat may be collected immediately
Throws:
java.lang.Exception - if the inputFormat can't be set successfully

input

public boolean input(Instance instance)
              throws java.lang.Exception
Input an instance for filtering. Ordinarily the instance is processed and made available for output immediately. Some filters require all instances be read before producing output.

Overrides:
input in class Filter
Parameters:
instance - the input instance
Returns:
true if the filtered instance may now be collected with output().
Throws:
java.lang.Exception - if the input instance was not of the correct format or if there was a problem with the filtering.

batchFinished

public boolean batchFinished()
                      throws java.lang.Exception
Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:
batchFinished in class Filter
Returns:
true if there are instances pending output
Throws:
java.lang.Exception - if no input structure has been defined

swap

public static void swap(int loc1,
                        int loc2,
                        java.lang.String[] a)

sort

public static java.lang.String[] sort(java.lang.String[] a)
Sorts an array of Strings into alphabetic order


main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments to the filter: use -h for help