Friday, September 16, 2011

Tuesday, February 23, 2010

KEA - Key Extraction Algorithm


Key Extraction Algorithm [KEA], is used to generates the Keywords for auto-indexing purpose. This algorithm includes two clause:

      1. Keywords, is a single word term. Where as,

      1. Key Phrase, is implies a multi-word lexeme.

Both above terms are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. For that task Keyphrase Indexing is used.

- Assigning keyphrases to a document is called keyphrase indexing.

This task of indexing can be done in two ways:

      1. Free Indexing

      2. Indexing with Controlled Vocabularies, which one should be used, depends on the requirement.

Steps to integrate KEA:


First we need to define a directory that contain all the documents to be proceed. There documents with extension ".txt", contains the stopwords which has english as default language. Language can be changed according to the requirement.


If one is using Free Indexing, than no need for vocabulary but in case of Controlled Indexing one has to define the vocabulary. Kea matches the document's phrases against the file.


Length of Keyphrases are pre-defined, Keyphrases do not start or end with stopword defined in the vocabulary. In case of Controlled Indexing, it collects only those words which are mathced with the thesaurus. If the thesaurus defines relations between non-allowed terms (non-descriptors) and allowed terms (descriptors), it replaces each descriptor by an equivalent non-descriptor.
In the above diagram,
pseudo-phrase matching means removing stopwords from the phrase, and then stemming and ordering the remaining words.


To collect the exact reqiured keyphrases, KEA compute,

  1. TF x IDF – measure describing the specificity of a term for this document under consideration, compared to all other documents. Extracted word, which has high TF x IDF value are more likely to be Keyphrase.

  2. Counting the first occurance of phrase in the document. Terms that tend to appear at the start or at the end of a document are more likely to be keyphrases.

  3. Length of phrase is also effective factor, works on defining keyphrase. It might be single word, two word or more word combination. It can be controlled at development time with KEA API.

  4. Node degree of a candidate phrase is the number of phrases in the candidate set that are semantically related to this phrase. This is computed with the help of the thesaurus. Phrases with high degree are more likely to be keyphrases.


Before being able to extract keyphrases from new documents, Kea first needs to create a model, which tells the extraction strategy. This means, for each document in the input directory there must be a file with the extension ".key" and the same name as the corresponding document. This file should contain manually assigned keyphrases, one per line. Given the list of the candidate phrases (3.), Kea marks those that were manually assigned as positive example and all the rest as negative examples. By analyzing the feature values (4.) for positive and negative candidate phrases, a model is computed, which reflects the distribution of feature values for each phrase.


The final step is extraction of keyphrase. Once the model has been build, according to the extraction policy defined in the model, keyphrase are started to be extracted from the defined documents. The degrees mentioned in Step-4 be counted and according to that measure final list of Keyphrases been retrieved. The number of how much keyphrease be extracted, can be controlled at the development time.

Thus, KEA algorithm is mostly used for the Auto-Extarction of Keywords for matadata or any other purpose.




KEA – Example :

Controlled Indexing :

Video Introduction to Indexing: