Dec 4, 2008

Identify Domain-Specific Words

We had a solution to the question posed in this post.

Recently, there is a discussion among Kaihua, Zhiyuan and I, where Kaihua introduced an interesting problem --- how to identify words that are specific to a certain domain. To make this problem well define, let us suppose that we have both domain-specific data, e.g., a set of reviews of IT products, as well background data, e.g., a (much) larger set of reviews of various products. The question is, how to find those words that are used specific in the domain of "IT products".

I think it is valuable to solve this problem. For example, given "IT product"-specific words like "screen size electric consumption weight response time", we can easily locate valuable phrases/sentences in reviews of IT products. This may lead to a new way of Web search --- given a query on an IT product, returns a summarized review that is an aggregation of many reviews.

Let us denote the domain-specific documents by D', the background documents by D. Denote the vocabulary of D and D' by V and V', where D'\in D and V'\in V.

Now, Zhiyuan has provided some preliminary results, which look promising. This preliminary results are about the domain of film in English Wikipedia corpus. Zhiyuan's method is basing on the following signals:
  1. N(term|domain)
  2. N(term|corpus)
  3. JS{ p(term|domain) || p(term|corpus) }
given N(term) and N(corpus).

Now we need to do the following things:
  1. (zkh) Figure out how to use above signals or need other signals.
  2. (wyi) Bayesian modeling of domain-specific word identification, in particular, considering the hierarchy of domains:
    background -> domain -> sub-domain -> entities --> reviews.
  3. (lzy) More detailed experiments using as background corpus, as domain specific data, restaurants in data as entities.
    1. The LDA model is at: /gfs/wf/home/hjbai/tianya/model2/word_distribution-00000-of-00001
    2. The training documents are at: /home/chengxu/shenghuo_data/xmlfeed/reviews/
    3. Edit reviews in (which can act as ground-truth) are at: /home/chengxu/shenghuo_data/xmlfeed/dianping_data.tsv
    Copied 2 and 3 to /home/wyi/dianping/ for convenient access of Zhiyuan.

No comments: