Dec 7, 2008

Multi-vocabulary LDA and Feature Weighting

How do we model documents which contain more than one type of words, or, say, contain words from multiple vocabularies? Usually, vocabularies have various noise-ratio, sparsity, etc, and we would like to given them various trust values.

It is notable that the full conditional posterior distribution of latent topics contains two factors --- (1) one related with the word under current focus, and (2) one on the global (document-level) topic distribution. So, even if the current word has ambiguous senses, the global topic distribution of the document trends to assign the word a "correct" topic. This is how LDA (but NOT pLSA) do disambiguation.

If some words in a document is from a vocabulary that is more trustable than others, we should scale up their occurrence counts, in order to make they larger impact to the global topic distribution.

No comments: