2. Thesaurus - If a vocabulary is provided, Kea matches the documents' phrases against this file. For processing SKOS files stored as rdf files, Kea uses the Jena API. For free indexing, use the option "-v none". 3. Extracting Candidates - Here Kea extracts
n-grams of a predefined length (e.g. 1 to 3 words) that do not start or
end with a stopword. In controlled indexing, it only collects those n-grams
that match thesaurus terms. If the thesaurus defines relations between
non-allowed terms (non-descriptors) and allowed terms (descriptors), it
replaces each descriptor by an equivalent non-descriptor. 4. Features - For each candidate phrase Kea computes 4 feature values:
5. Building the model - Before being able to
extract keyphrases from new documents, Kea first needs to create a model
that learns the extraction strategy from manually indexed documents. This
means, for each document in the input directory there must be a file with
the extension ".key" and the same name as the corresponding
document. This file should contain manually assigned keyphrases, one per
line. 6. Extracting keyphrases - When extracting keyphrases from new documents, Kea takes the model (5.) and feature values for each candidate phrase and computes its probability of being a keyphrase. Phrases with the highest probabilities are selected into the final set of keyphrases. The user can specify the number of keyphrases that need to be selected. |