|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.filters.Filter
KEAFilter
public class KEAFilter
This filter converts the incoming data into data appropriate for keyphrase classification. It assumes that the dataset contains two string attributes. The first attribute should contain the text of a document. The second attribute should contain the keyphrases associated with that document (if present). The filter converts every instance (i.e. document) into a set of instances, one for each word-based n-gram in the document. The string attribute representing the document is replaced by some numeric features, the estimated probability of each n-gram being a keyphrase, and the rank of this phrase in the document according to the probability. Each new instances also has a class value associated with it. The class is "true" if the n-gram is a true keyphrase, and "false" otherwise. Of course, if the input document doesn't come with author-assigned keyphrases, the class values for that document will be missing.
Field Summary | |
---|---|
static Vocabulary |
m_Vocabulary
The Vocabulary object |
Constructor Summary | |
---|---|
KEAFilter()
|
Method Summary | |
---|---|
boolean |
batchFinished()
Signify that this batch of input to the filter is finished. |
boolean |
getCheckForProperNouns()
Get the M_CheckProperNouns value. |
boolean |
getDebug()
Get the value of Debug. |
boolean |
getDisallowInternalPeriods()
Get whether the supplied columns are to be processed |
int |
getDocumentAtt()
Get the value of DocumentAtt. |
int |
getKeyphrasesAtt()
Get the value of KeyphraseAtt. |
boolean |
getKFused()
Gets whether keyphrase frequency attribute is used. |
int |
getMaxPhraseLength()
Get the value of MaxPhraseLength. |
int |
getMinNumOccur()
Get the value of MinNumOccur. |
int |
getMinPhraseLength()
Get the value of MinPhraseLength. |
int |
getNumPhrases()
Get the value of numPhrases. |
java.lang.String[] |
getOptions()
Gets the current settings of the filter. |
int |
getProbabilityIndex()
Returns the index of the phrases' probabilities in the output ARFF file. |
int |
getRankIndex()
Returns the index of the phrases' ranks in the output ARFF file. |
int |
getStemmedPhraseIndex()
Returns the index of the stemmed phrases in the output ARFF file. |
Stemmer |
getStemmer()
Get the Stemmer value. |
Stopwords |
getStopwords()
Get the M_Stopwords value. |
int |
getUnstemmedPhraseIndex()
Returns the index of the unstemmed phrases in the output ARFF file. |
java.lang.String |
getVocabulary()
Get the M_Vocabulary value. |
java.lang.String |
getVocabularyFormat()
Get the M_VocabularyFormat value. |
java.lang.String |
globalInfo()
Returns a string describing this filter |
boolean |
input(Instance instance)
Input an instance for filtering. |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options |
void |
loadThesaurus(Stemmer st,
Stopwords sw)
|
static void |
main(java.lang.String[] argv)
Main method for testing this class. |
void |
setCheckForProperNouns(boolean newM_CheckProperNouns)
Set the M_CheckProperNouns value. |
void |
setDebug(boolean newDebug)
Set the value of Debug. |
void |
setDisallowInternalPeriods(boolean disallow)
Set whether selected columns should be processed. |
void |
setDocumentAtt(int newDocumentAtt)
Set the value of DocumentAtt. |
boolean |
setInputFormat(Instances instanceInfo)
Sets the format of the input instances. |
void |
setKeyphrasesAtt(int newKeyphrasesAtt)
Set the value of KeyphrasesAtt. |
void |
setKFused(boolean flag)
Sets whether keyphrase frequency attribute is used. |
void |
setMaxPhraseLength(int newMaxPhraseLength)
Set the value of MaxPhraseLength. |
void |
setMinNumOccur(int newMinNumOccur)
Set the value of MinNumOccur. |
void |
setMinPhraseLength(int newMinPhraseLength)
Set the value of MinPhraseLength. |
void |
setNumFeature()
Sets whether Vocabulary relation attribute is used. |
void |
setNumPhrases(int newnumPhrases)
Set the value of numPhrases. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options controlling the behaviour of this object. |
void |
setStemmer(Stemmer newStemmer)
Set the Stemmer value. |
void |
setStopwords(Stopwords newM_Stopwords)
Set the M_Stopwords value. |
void |
setVocabulary(java.lang.String newM_Vocabulary)
Set the M_Vocabulary value. |
void |
setVocabularyFormat(java.lang.String newM_VocabularyFormat)
Set the M_VocabularyFormat value. |
static java.lang.String[] |
sort(java.lang.String[] a)
Sorts an array of Strings into alphabetic order |
static void |
swap(int loc1,
int loc2,
java.lang.String[] a)
|
Methods inherited from class weka.filters.Filter |
---|
batchFilterFile, filterFile, getOutputFormat, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputPeek, useFilter |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static Vocabulary m_Vocabulary
Constructor Detail |
---|
public KEAFilter()
Method Detail |
---|
public java.lang.String getVocabulary()
public void setVocabulary(java.lang.String newM_Vocabulary)
newM_Vocabulary
- The new M_Vocabulary value.public java.lang.String getVocabularyFormat()
public void setVocabularyFormat(java.lang.String newM_VocabularyFormat)
newM_VocabularyFormat
- The new M_VocabularyFormat value.public boolean getCheckForProperNouns()
public void setCheckForProperNouns(boolean newM_CheckProperNouns)
newM_CheckProperNouns
- The new M_CheckProperNouns value.public Stopwords getStopwords()
public void setStopwords(Stopwords newM_Stopwords)
newM_Stopwords
- The new M_Stopwords value.public Stemmer getStemmer()
public void setStemmer(Stemmer newStemmer)
newStemmer
- The new Stemmer value.public int getMinNumOccur()
public void setMinNumOccur(int newMinNumOccur)
newMinNumOccur
- Value to assign to MinNumOccur.public int getMaxPhraseLength()
public void setMaxPhraseLength(int newMaxPhraseLength)
newMaxPhraseLength
- Value to assign to MaxPhraseLength.public int getMinPhraseLength()
public void setMinPhraseLength(int newMinPhraseLength)
newMinPhraseLength
- Value to assign to MinPhraseLength.public int getNumPhrases()
public void setNumPhrases(int newnumPhrases)
newnumPhrases
- Value to assign to numPhrases.public int getStemmedPhraseIndex()
public int getUnstemmedPhraseIndex()
public int getProbabilityIndex()
public int getRankIndex()
public int getDocumentAtt()
public void setDocumentAtt(int newDocumentAtt)
newDocumentAtt
- Value to assign to DocumentAtt.public int getKeyphrasesAtt()
public void setKeyphrasesAtt(int newKeyphrasesAtt)
newKeyphrasesAtt
- Value to assign to KeyphrasesAtt.public boolean getDebug()
public void setDebug(boolean newDebug)
newDebug
- Value to assign to Debug.public void setKFused(boolean flag)
public void setNumFeature()
public boolean getKFused()
public boolean getDisallowInternalPeriods()
public void setDisallowInternalPeriods(boolean disallow)
disallow
- the new invert settingpublic void loadThesaurus(Stemmer st, Stopwords sw)
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-K
Specifies whether keyphrase frequency statistic is used.
-R
Specifies whether Vocabulary relation statistic is used.
-M length
Sets the maximum phrase length (default: 3).
-L length
Sets the minimum phrase length (default: 1).
-D
Turns debugging mode on.
-I index
Sets the index of the attribute containing the documents (default: 0).
-J index
Sets the index of the attribute containing the keyphrases (default: 1).
-P
Disallow internal periods
-O number
The minimum number of times a phrase needs to occur (default: 2).
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
public java.lang.String globalInfo()
public boolean setInputFormat(Instances instanceInfo) throws java.lang.Exception
setInputFormat
in class Filter
instanceInfo
- an Instances object containing the input
instance structure (any instances contained in the object are
ignored - only the structure is required).
java.lang.Exception
- if the inputFormat can't be set successfullypublic boolean input(Instance instance) throws java.lang.Exception
input
in class Filter
instance
- the input instance
java.lang.Exception
- if the input instance was not of the correct
format or if there was a problem with the filtering.public boolean batchFinished() throws java.lang.Exception
batchFinished
in class Filter
java.lang.Exception
- if no input structure has been definedpublic static void swap(int loc1, int loc2, java.lang.String[] a)
public static java.lang.String[] sort(java.lang.String[] a)
public static void main(java.lang.String[] argv)
argv
- should contain arguments to the filter: use -h for help
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |