Tech Notes of Yi Wang: Multi-vocabulary LDA and Feature Weighting

Dec 7, 2008

Multi-vocabulary LDA and Feature Weighting

How do we model documents which contain more than one type of words, or, say, contain words from multiple vocabularies? Usually, vocabularies have various noise-ratio, sparsity, etc, and we would like to given them various trust values.

It is notable that the full conditional posterior distribution of latent topics contains two factors --- (1) one related with the word under current focus, and (2) one on the global (document-level) topic distribution. So, even if the current word has ambiguous senses, the global topic distribution of the document trends to assign the word a "correct" topic. This is how LDA (but NOT pLSA) do disambiguation.

If some words in a document is from a vocabulary that is more trustable than others, we should scale up their occurrence counts, in order to make they larger impact to the global topic distribution.

Tech Notes of Yi Wang

Dec 7, 2008

Multi-vocabulary LDA and Feature Weighting

No comments:

About Me

Blog Archive

Followers