File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1039_evalu.xml

Size: 13,839 bytes

Last Modified: 2025-10-06 13:59:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1039">
  <Title>Multi-Document Summarization of Evaluative Text</Title>
  <Section position="6" start_page="308" end_page="311" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluated our two summarizers by performing a user study in which four treatments were considered: SEA, MEAD*, human-written summaries as a topline and summaries generated by MEAD (with all options set to default) as a baseline.</Paragraph>
    <Section position="1" start_page="308" end_page="309" type="sub_section">
      <SectionTitle>
5.1 The Experiment
</SectionTitle>
      <Paragraph position="0"> Twenty-eight undergraduate students participated in our experiment, seven for each treatment. Each participant was given a set of 20 customer reviews randomly selected from a corpus of reviews. In each treatment three participants received reviews from a corpus of 46 reviews of the Canon G3 digital camera and four received them from a corpus of 101 reviews of the Apex 2600 Progressive Scan DVD player, both obtained from Hu and Liu (2004b). The reviews from these corpora which serve as input to our systems have been manually annotated with crude features, strength, and polarity. We used this 'gold standard' for crude feature, strength, and polarity extraction because we wanted our experiments to focus on our summary and not be confounded by errors in the knowledge extraction phase.</Paragraph>
      <Paragraph position="1"> The participant was told to pretend that they work for the manufacturer of the product (either Canon or Apex). They were told that they would have to provide a 100 word summary of the reviews to the quality assurance department. The purpose of these instructions was to prime the user to the task of looking for information worthy of summarization. They were then given 20 minutes to explore the set of reviews.</Paragraph>
      <Paragraph position="2"> After 20 minutes, the participant was asked to stop. The participant was then given a set of in- null structions which explained that the company was testing a computer-based system for automatically generating a summary of the reviews s/he has been reading. S/he was then shown a 100 word summary of the 20 reviews generated either by MEAD, MEAD*, SEA, or written by a human 4.</Paragraph>
      <Paragraph position="3"> Figure 2 shows four summaries of the same 20 reviews, one of each type.</Paragraph>
      <Paragraph position="4"> In order to facilitate their analysis, summaries were displayed in a web browser. The upper portion of the browser contained the text of the summary with 'footnotes' linking to reviews on which the summary wasbased. For MEADand MEAD*, for each sentence the footnote pointed to the review from which the sentence had been extracted. For SEA and human-generated summaries, for each aggregate evaluation the footnote pointed to the review containing a sample sentence on which that evaluation was based. In all summaries, clicking on one of the footnotes caused the corresponding review to be displayed in which the appropriate sentence was highlighted.</Paragraph>
      <Paragraph position="5"> Once finished, the participant was asked to fill out a questionnaire assessing the summary along several dimensions related toitseffectiveness. The participant could still access the summary while s/he worked on the questionnaire.</Paragraph>
      <Paragraph position="6"> Our questionnaire consisted of nine questions.</Paragraph>
      <Paragraph position="7"> The first five questions were the SEE linguistic well-formedness questions used at the 2005 Document Understanding Conference (DUC) (Nat, 2005a). The next three questions were designed to assess the content of the summary. We based our questions on the Responsive evaluation at DUC 2005; however, we were interested in a more specific evaluation of the content that one overall rank. As such, we split the content into the following three separate questions: a0 (Recall) The summary contains all of the information you would have included from the source text.</Paragraph>
      <Paragraph position="8"> a0 (Precision) The summary contains no information you would NOT have included from the source text.</Paragraph>
      <Paragraph position="9"> a0 (Accuracy) All information expressed in the summary accurately reflects the information contained in the source text.</Paragraph>
      <Paragraph position="10"> The final question in the questionnaire asked the participant to rank the overall quality of the summary holistically.</Paragraph>
      <Paragraph position="11"> 4For automatically generated summaries, we generated the longest possible summary with less than 100 words.</Paragraph>
    </Section>
    <Section position="2" start_page="309" end_page="310" type="sub_section">
      <SectionTitle>
5.2 Quantitative Results
</SectionTitle>
      <Paragraph position="0"> Table 1 consists of two parts. The first top half focuses on linguistic questions while the second bottom half focuses on content issues. We performed a two-way ANOVA test with summary type as rows and the question sets as columns. Overall, it is easy to conclude that MEAD* and SEA performed at a roughly equal level, while the baseline MEAD performed significantly lower and the Human summarizer significantly higher (p a1 a3 001).</Paragraph>
      <Paragraph position="1"> When individual questions/categories are considered, there are few questions that differentiate between MEAD* and SEA with a p-value below 0.05. The primary reason is our small sample size.</Paragraph>
      <Paragraph position="2"> Nonetheless, if we relax the p-value threshold, we can make the following observations/hypotheses.</Paragraph>
      <Paragraph position="3"> To validate some of these hypotheses, we would conduct a larger user study in future work.</Paragraph>
      <Paragraph position="4"> On the linguistic side, the average score suggests the ordering of: Human a2</Paragraph>
      <Paragraph position="6"> SEA are also on par with the median DUC score (Nat, 2005b). On the focus question, in fact, SEA's score is tied with the Human's score, which may be a beneficial effect of the UDF guiding content structuring in a top-down fashion. It is also interesting to see that SEA outperforms MEAD* on grammaticality, showing that the generative text approach may be more effective than simply extracting sentences on this aspect of grammaticality. On the other hand, MEAD* out-performs SEA on non-redundancy, and structure and coherence. SEA's disappointing performance on structure and coherence was among the most surprising finding. One possibility is that our adaptation of GEA content structuring strategy was suboptimal or even inappropriate. We plan to investigate possible causes in the future.</Paragraph>
      <Paragraph position="7"> On the content side, the average score suggeststhe ordering of: Human a2 SEA a2 MEADa3a4a2 MEAD. As for the three individual content questions, on the recall one, both SEA and MEAD* were dominated by the Human summarizer. This indicates that both SEA and MEAD* omit some features considered important. We feel that if a longer summary was allowed, the gap between the two and the Human summarizer would be narrower. The precision question is somewhat surprising in that SEA actually performs better than the Human summarizer. In general this indicates that the feature selection strategy was quite suc- null MEAD*: Bottom line , well made camera , easy to use , very flexible and powerful features to include the ability to use external flash and lense / filters choices . 1It has a beautiful design , lots of features , very easy to use , very configurable and customizable , and the battery duration is amazing ! Great colors , pictures and white balance. The camera is a dream to operate in automode , but also gives tremendous flexibility in aperture priority , shutter priority , and manual modes . I 'd highly recommend this camera for anyone who is looking for excellent quality pictures and a combination of ease of use and the flexibility to get advanced with many options to adjust if you like. SEA: Almost all users loved the Canon G3 possibly because some users thought the physical appearance was very good. Furthermore, several users found the manual features and the special features to be very good. Also, some users liked the convenience because some users thought the battery was excellent. Finally, some users found the editing/viewing interface to be good despite the fact that several customers really disliked the viewfinder . However, there were some negative evaluations. Some customers thought the lens was poor even though some customers found the optical zoom capability to be excellent. Most customers thought the quality of the images was very good.</Paragraph>
      <Paragraph position="8"> MEAD: I am a software engineer and am very keen into technical details of everything i buy , i spend around 3 months before buying the digital camera ; and i must say , g3 worth every single cent i spent on it . I do n't write many reviews but i 'm compelled to do so with this camera . I spent a lot of time comparing different cameras , and i realized that there is not such thing as the best digital camera . I bought my canon g3 about a month ago and i have to say i am very satisfied .</Paragraph>
      <Paragraph position="9"> Human: The Canon G3 was received exceedingly well. Consumer reviews from novice photographers to semi-professional all listed an impressive number of attributes, they claim makes this camera superior in the market. Customers are pleased with the many features the camera offers, and state that the camera is easy to use and universally accessible. Picture quality, long lasting battery life, size and style were all highlighted in glowing reviews. One flaw in the camera frequently mentioned was the lens which partially obsructs the view through the view finder, however most claimed it was only a minor annoyance since they used the LCD sceen.</Paragraph>
      <Paragraph position="10">  to 5 (Strongly Agree).</Paragraph>
      <Paragraph position="11"> cessful. Finally, for the accuracy question, SEA is closer to the Human summarizer than MEAD*. In sum, recall that for evaluative text, it is very possible that different reviews express different opinions on the same question. Thus, for the summarization of evaluative text, when there is a difference in opinions, it is desirable that the summary accurately covers both angles or conveys the disagreement. On this count, according to the scores on the precision and accuracy questions, SEA appears to outperform MEAD*.</Paragraph>
    </Section>
    <Section position="3" start_page="310" end_page="311" type="sub_section">
      <SectionTitle>
5.3 Qualitative Results
</SectionTitle>
      <Paragraph position="0"> MEAD*: The most interesting aspect of the comments made by participants who evaluated MEAD*-based summaries was that they rarely criticized the summary for being nothing more than a set of extracted sentences. For example, one user claimed that the summary had a &amp;quot;simple sentence first, then ideas are fleshed out, and ends with a fun impact statement&amp;quot;. Other users, while noticing that the summary was solely quotation, still felt the summary was adequate (&amp;quot;Shouldn't just copy consumers . . . However, it summarized various aspects of the consumer's opinions . . . &amp;quot;). With regard to content, two main complaints by participants were: (i) the summary did not reflect overall opinions (e.g., included positive evaluations of the DVD player even though most evaluations were negative), and (ii) the evaluations of some features were repeated. The first complaint is consistent with the relatively low score of MEAD* on the accuracy question.</Paragraph>
      <Paragraph position="1"> Wecould address thiscomplaint by only including sentences whose CF evaluations have polarities matching the majority polarity for each CF. The second complaint could be avoided by not selecting sentences which contain evaluations of CFs already in the summary.</Paragraph>
      <Paragraph position="2"> SEA: Comments about the structure of the summaries generated bySEAmentioned the &amp;quot;coherent but robotic&amp;quot; feel of the summaries, the repetition of &amp;quot;users/customers&amp;quot; and lack of pronoun use, the lack of flow between sentences, and the repeated use of generic terms such as &amp;quot;good&amp;quot;. These problems are largely a result of simplistic microplanning and seems to contradict SEA's disappointing performance on the structure and coherence ques- null tion.</Paragraph>
      <Paragraph position="3"> In terms of content, there were two main sets of complaints. Firstly, participants wanted more &amp;quot;details&amp;quot; in the summary, for instance, they wanted examples of the &amp;quot;manual features&amp;quot; mentioned by SEA. Note that this is one complaint absent from the MEAD* summaries. That is, where the MEAD* summaries lack structure but contain detail, SEA summaries provide a general, structured overview while lacking in specifics.</Paragraph>
      <Paragraph position="4"> The other set of complaints related to the problem that participants disagreed with the choice of features in the summary. We note that this is actually a problem common to MEAD* and even the Human summarizer. The best example to illustrate this point is on the &amp;quot;physical appearance&amp;quot; of the digital camera. One reason participants may have disagreed with the summarizer's decision to include the physical appearance in the summary is that some evaluations of the physical appearance were quite subtle. For example, the sentence &amp;quot;This camera has a design flaw&amp;quot; was annotated in our corpus as evaluating the physical appearance, although not all readers would agree with that annotation. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML