Dec 8, 2008

A Latent Topic Model for Domain-Specific Attributes Extraction

It is difficult to estimate P(w|d) directly because of the high-dimensionality of d and w, and the sparsity of data.
  1. Under the constraint that we consider a finite set of documents, for example, n pLSA, these are the training documents and may have up to 1M. Consider the vocabulary size is also a large number, P(w|d) has huge dimensionality and requires too much data to do sufficient statistics in estimation.
  2. In practice, we prefer to realize the fact that it is an infinite set of documents, as modeled by LDA. If there is not latent topics, it would be intractable to estimate P(w|d) either theoretically or practically.
So, we introduce latent topics to factorize
P(w|d) = \sum_{z=1}^K P(w|z) P(z|d)
Note that the number of topics K is much more smaller than the vocabulary size V and the number of documents which is infinite.

In practice, latent topic modeling is particularly valuable in the case with sparse data. For an instance, let us consider the problem of understanding the meaning of a Web search query. Usually, a query contains only few words. If there is not latent topics and we understand the query (document) as a word distribution, a query would correspond to few non-zero values in a V-dimensional distribution --- this is too little information, difficult to understand the document --- we do not even have any confidence that some of the few words are "noise".

In contrast, given latent topic modeling, each word in the document can be associated with a topic distribution by Bayesian inference. Note that each topic is a set of weighted words learned from the potentially huge (sufficient) training corpus. This means that latent topic modeling understands documents by sufficient knowledge collected from training data, instead of a word distribution with few non-zeros values.

We are recently using the idea of latent topic modeling to identify domain-specific words. In this case, a domain is similar to a super-document. Using latent topic factorizing, we design a hierarchical latent topic model (ref. figure to the right) to estimate P(w|z), P(z|d) and P(d|c), where w denotes a word, z denotes a topic, d denotes a document and c denotes a domain (category).

Note that the model to the right can handle the case where a document is labeled belonging to multiple domains. An extension to it using ideas from Pachinko allocation could enhance it to handle the case where domains are organized into a hierarchy.

Given the learning result, we can use Gibbs sampling to infer the word distribution of a given domain
p(w|c) = \sum_{z=1}^K \left( p(w|z) \int p(z|d) P(d|c) \; \mathrm{d}d \right)
where the integration can be computed in close-form using the conjugacy between Dirichlet and multinational. Similarly, we can infer p(c|w). These two quantities allow us to compute the rank words in a domain by the "characteristic" measure (proposed by David Cohn in his probabilistic HITS algorithm):
C(w|c) = P(w|c) P(c|w)
References:
Please refer to this post for details.

No comments: