Dec 12, 2008

A Review on Attribute or Summarization Extraction

  1. Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema and Andrew Fano. Text Mining to Extract Product Attributes. SIGKDD Explorations (2006).
  2. Popescu, Ana-Maria, Etzioni, Oren. Extracting Product Features And Opinions From Reviews. EMNLP 2005.
  3. Sujith Ravi, Marius Pasca. Using Structured Text for Large-Scale Attribute Extraction, CIKM 2008.
  4. Patrick Nguyen, Milind Mahajan and Geoffrey Zweig. Summarization of Multiple User Reviews in the Restaurant Domain. MSR Technical Report. 2007.
    A brief from this paper describing their system:
    "..., we build a classifier which predicts, from each review, a numerical rating associated to the review. Then, each review is segmented into ... snippets. .... Each snippet is then assigned a bipolar relevance score, to mark it as characteristically disparaging (bad) or praising (good). Snippets are categorized into predefined types (such as food, or service). To produce the final review, we tally the most relevant snippets, both good and bad, in each category."
    For the following reasons, I do NOT believe this method works nor I would cite this tech report: (1) Authors do NOT explain how they "tally the most relevant snippets". (2) I do not believe a classifier can well predict a rating from a review. (3) The usage of pLSA is wrong.

  5. Rada Mihalcea and Hakan Ceylan. Explorations in Automatic Book Summarization. EMNLP 2007.
    This paper presents segments a book into segments, and rank segments by TextRank, which is PageRank on textual vertices (ref [6]).

  6. Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. EMNLP 2004.
    This paper addresses the Keyword Extraction problem. The solution is simple, construct a graph whose each vertex is a word from a corpus, and an edge indicate a co-occurrence relation between two words. Run PageRank on this graph to rank words. As this method does not employ latent topics, I guess words like "这些“, "非常" will in high priority in the rank list. It may be useful to compare this method with ranking by P(w|c), where w denotes words and c denotes domain.

  7. Rada Mihalcea. Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization. ACL 2004.
    This paper is very similar to [6]. Differences include: (1) rank sentences instead of words, (2) the similarity between two sentences is measured by Jaccard coefficient, and (3) use HITS in addition to PageRank. It does not explain how to use the authority and hub values respectively.

  8. Gunes¸ Erkan and Dragomir R. Radev. LexPageRank: Prestige in Multi-Document Text Summarization. EMNLP 2004.
    This is similar to [4], [5] and [6]. It ranks sentences from various documents using PageRank. Similarity between sentences is computed by cosine distance. And I do not see any explanation on what is the cosine distance between two sentences.

  9. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. WWW 2008.
Attachments:
  1. A Python implementation of PageRank, which may be used in our experiments.

No comments: