Recently, there is a discussion among Kaihua, Zhiyuan and I, where Kaihua introduced an interesting problem --- how to identify words that are specific to a certain domain. To make this problem well define, let us suppose that we have both domain-specific data, e.g., a set of reviews of IT products, as well background data, e.g., a (much) larger set of reviews of various products. The question is, how to find those words that are used specific in the domain of "IT products".
I think it is valuable to solve this problem. For example, given "IT product"-specific words like "screen size electric consumption weight response time", we can easily locate valuable phrases/sentences in reviews of IT products. This may lead to a new way of Web search --- given a query on an IT product, returns a summarized review that is an aggregation of many reviews.
Let us denote the domain-specific documents by D', the background documents by D. Denote the vocabulary of D and D' by V and V', where D'\in D and V'\in V.
Now, Zhiyuan has provided some preliminary results, which look promising. This preliminary results are about the domain of film in English Wikipedia corpus. Zhiyuan's method is basing on the following signals:
- N(term|domain)
- N(term|corpus)
- JS{ p(term|domain) || p(term|corpus) }
Now we need to do the following things:
- (zkh) Figure out how to use above signals or need other signals.
- (wyi) Bayesian modeling of domain-specific word identification, in particular, considering the hierarchy of domains:
background -> domain -> sub-domain -> entities --> reviews. - (lzy) More detailed experiments using Tianya.cn as background corpus, Dianping.com as domain specific data, restaurants in Dianping.com data as entities.
- The Tianya.cn LDA model is at: /gfs/wf/home/hjbai/tianya/model2/word_distribution-00000-of-00001
- The Dianping.com training documents are at: /home/chengxu/shenghuo_data/xmlfeed/reviews/dianping.com.tsv
- Edit reviews in Dianping.com (which can act as ground-truth) are at: /home/chengxu/shenghuo_data/xmlfeed/dianping_data.tsv
No comments:
Post a Comment