File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1005_metho.xml

Size: 26,232 bytes

Last Modified: 2025-10-06 14:09:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1005">
  <Title>Vocabulary Usage in Newswire Summaries</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Corpus
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 General Organization
</SectionTitle>
      <Paragraph position="0"> The authors have assembled a corpus of manually written summaries of texts from their archive of materials provided to participants in the DUC conferences, held annually since 2001. It is available at the DUC Web site to readers who are qualified to access the DUC document sets on application to NIST. To help interested parties assess it for their purposes we provide more detail than usual on its organization and contents.</Paragraph>
      <Paragraph position="1"> Most summaries in the corpus are abstracts, written by human readers of the source document to best express its content without restriction in any manner save length (words or characters). One method of performing automatic summarization is to construct the desired amount of output by concatenating representative sentences from the source document, which reduces the task to one of determining most adequately what 'representative' means. Such summaries are called extracts. In 2002, recognizing that many participants summarize by extraction, NIST produced versions of documents divided into individual sentences and asked its author volu nteers to compose their summaries similarly. Because we use a sentence-extraction technique in our summarization system, this data is of partic ular interest to us. It is not included in the corpus being treated here and will be discussed in a separate paper.</Paragraph>
      <Paragraph position="2"> The DUC corpus contains 11,867 files organized in a three-level hierarchy of directories totalling 62MB. The top level identifies the source year and exists simply to avoid the name collision which occurs when different years use same-named subdirectories. The middle 291 directories identify the document clusters; DUC reuses collections of newswire stories assembled for the TREC and TDT research in itiatives which report on a common topic or theme. Directories on the lowest level contain SGML-tagged and untagged versions of 2,781 individual source documents, and between one and five summaries of each, 9,086 summaries in total. In most cases the document involved is just that: a single news report originally published in a newspaper.</Paragraph>
      <Paragraph position="3"> 552 directories, approximately 20% of the corpus, represent multi-document summaries--ones which the author has based on all the files in a cluster of related documents. For these summaries a source document against which to compare them has been constructed by concatenating the individual documents in a cluster into one file. Concatenation is done in directory order, though the order of documents does not matter here.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Corpus in Detail
The Document Understanding Conference has
</SectionTitle>
      <Paragraph position="0"> evolved over the four years represented in our corpus, and this is reflected in the materials which are available for our purposes. Table 1 cla ssifies these files by year and by target size of summary; the rightmost column indicates the ratio of summaries to source documents, that is, the average number of summaries per document. Totals appear in bold. The following factors of interest can be identified in its data: * Size . Initially DUC targeted summaries of 50, 100 and 200 words. The following year 10-word summaries were added, and in 2003 only 10and 100-word summaries were produced; * Growth. Despite the high cost of producing manual summaries, the number of documents under consideration has doubled over the four years under study while the number of summaries has tripled; * Ratio. On average, three manual summaries are available for each source document; * Formation. While longer summaries are routinely composed of well-formed sentences, sub-sentential constructs such as headlines are acceptable 10-word summaries, as are lists of key words and phrases.</Paragraph>
      <Paragraph position="1"> * Author. Although the 2004 DUC source documents include machine translations of foreign language news stories, in each case a parallel human translation was available. Only source documents written or translated by human beings appear in the corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Evaluation Model
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the typical contents of a third-level source document directory. Relations we wish to investigate are marked by arrows. There are two: the relationship between the vocabulary used in the source document and summaries of it, and that among the vocabulary used in summaries themselves. The first is marked by white arrows, the second by grey.</Paragraph>
    <Paragraph position="1"> The number of document-summary relations in the corpus is determined by the larger cardinality set involved, which here is the number of summaries: thus 9,086 instances. For every document with N summaries, we consider all C(N, 2) pairs of summaries. In total there are 11,441 summary-summary relatio nships.</Paragraph>
    <Paragraph position="2"> We ask two questions: to what degree do summaries use words appearing in the source document? and, to what degree do different summaries use the same vocabulary?</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Measures
</SectionTitle>
      <Paragraph position="0"> To answer our two questions we decided to compute statistics on two types of elements of each pair of test documents: their phrases, and ultimately, their  individual tokens. Phrases were extracted by applying a 987-item stop list developed by the authors (Copeck and Szpakowicz 2003) to the test documents. Each collocation separated by stop words is taken as a phrase1. Test documents were tokenized by breaking the text on white space and trimming off punctuation external to the token. Instances of each sort of item were recorded in a hash table and written to file.</Paragraph>
      <Paragraph position="1"> Tokens are an obvious and unambiguous baseline for lexical agreement, one used by such summary evaluation systems as ROUGE (Lin and Hovy, 2003). On the other hand, it is important to explain what we mean by units we call phrases; they should not be confused with syntactically correct constituents such as noun phrases or verb phrases. Our units often are not syntactically well-formed. Adjacent constituents not separated by a stop word are unified, single constituents are divided on any embedded stop word, and those composed entirely of stop words are simply missed.</Paragraph>
      <Paragraph position="2"> Our phrases, however, are not n-grams. A 10-word summary has precisely 9 bigrams but, in this study, only 3.4 phrases on average (Table 2). On the continuum of grammatic ality these units can thus be seen as lying somewhere between generated blindly n-grams and syntactically well-formed phrasal constituents. We judge them to be weakly syntactically motivated2 and only roughly analogous to the factoids identified by van Halteren and Teufel (2003) in the sense that they also express semantic constructs. Where van Halteren and Teufel identified factoids in 50 summaries, we sacrificed accuracy for automation in order to process 9000.</Paragraph>
      <Paragraph position="3"> We then assessed the degree to which a pair of documents for comparison shared vocabulary in terms of these units. This was done by counting matches between the phrases. Six different kinds of match were identified and are listed here in what we deem to be decreasing order of stringency. While the match types are labelled and described in terms of summary and source document for clarity, they apply equally to summary pairs. Candidate phrases are underlined and matching elements tinted in the examples; headings used in the results table (Table 2) appear in SMALL CAPS.</Paragraph>
      <Paragraph position="4"> 1 When analysis of a summary indicated that it was a list of comma- or semicolon-delimited phrases, the phrasing provided by the summary author was adopted, including any stopwords present. Turkey attacks Kurds in Iraq, warns Syria, accusations fuel tensions, Mubarak intercedes is thus split into four phrases with the first retaining the stopword in. There are 453 such summaries.</Paragraph>
      <Paragraph position="5"> 2 While the lexical units in question might be more accurately labelled syntactically motivated ngrams, for simplicity we use phrase in the discussion.</Paragraph>
      <Paragraph position="6">  * Exact match. The most demanding, requires candidates agree in all respects. EXACT after Mayo Clinic stay -Mayo Clinic group * Case-insensitive exact match relaxes the re- null quirement for agreement in case. EXACT CI concerning bilateral relations Bilateral relations with * Head of summary phrase in document requires only that the head of the candidate appear in the source document phrase. The head is the rightmost word in a phrase. HEAD DOC calls Sharon disaster deemed tantamount to disaster * Head of document phrase in summary is the previous test in reverse. HEAD SUM * Summary phrase is substring of document phrase. True if the summary phrase appears anywhere in the document phrase. SUB DOC has identified Iraqi agent as the Iraqi agent defection * Document phrase is substring of summary phrase reverses the previous test. SUB SUM Tests for matches between the tokens of two documents are more limited because only single lexical items are involved. Exact match can be supplemented by case insensitivity and by stemming to identify any common root shared by two tokens.</Paragraph>
      <Paragraph position="7"> The Porter stemmer was used.</Paragraph>
      <Paragraph position="8"> The objective of all these tests is to capture any sort of meaningful resemblance between the vocabularies employed in two texts. Without question, additional measures can and should be identified and implemented to correct, expand, and refine the analysis.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Methodology
</SectionTitle>
      <Paragraph position="0"> The study was carried out in three stages. A pre-study determined the &amp;quot;lie of the land&amp;quot;--what the general character of results was likely to be, the most appropriate methodology to realize them, and so on. In particular this initial investigation alerted us to the fact that so few phrases in any two texts under study matched exactly as to provide little useful data, leading us to add more relaxed measures of lexical agreement. This in itial investigation made it clear that there was no point in attempting to find a subset of vocabulary used in a number of summaries--it would be vanishingly small--and we therefore confined ourselves in the main study to pairwise comparisons. The pre-study also suggested that summary size would be a significant factor in lexical agreement while source document size would be less so, indications which were not entirely borne out by the strength of the results ult imately observed.</Paragraph>
      <Paragraph position="1"> The main study proceeded in two phases. After the corpus had been organized as described in Section 2 and untagged versions of the source documents produced for the analysis program to work on, that process traversed the directory tree, decomposing each text file into its phrases and tokens.</Paragraph>
      <Paragraph position="2"> These were stored in hash tables and written to file to provide an audit point on the process. The hash tables were then used to test each pair of test documents for matches--the source document to each summary, and all combinations of summarie s. The resulting counts for all comparisons together with other data were then written to a file with results, one line per source document in a comma-delimited format suitable for importation to a spreadsheet program. null The second phase of the main study involved organizing the spreadsheet data into a format permitting the calculation of statistics on various categorizations of documents they describe. Because the source document record was variable -length in itself and also contained a varying number of variable -length sub-records of document pair comparisons, this was a fairly time-consuming clerical task. It did however provide the counts and averages presented in Table 2 and subsequently allowed the user to recategorize the data fairly easily.</Paragraph>
      <Paragraph position="3"> A post-study was then conducted to validate the computation of measures by reporting these to the user for individual document sets, and applied to a  small random sample of text pairs. Figure 2 shows the comparison of two summaries of source document AFA19981230.1000.0058. A secondary objective of the post-study was to inspect the actual data. Were there factors in play in the data that had escaped us? None were made evident beyond the all-too-familiar demonstration of the wide variety of language use in play. The log file of document phrase hash tables provided an additional snapshot of the kind of materials with which the automated computation had been working. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Data Averages
</SectionTitle>
      <Paragraph position="0"> Table 2 illustrates the degree to which summaries in the DUC corpus employ the same vocabulary as the source documents on which they are based and the degree to which they resemble each other in wording. The table, actually a stack of four tables which share common headings, presents data on the document-summary relationship followed by inter-summary data, giving counts and then percentages for each relationship. Statistics on the given relationship appear in the first three columns on the left; counts and averages are classified by summary size. The central group of six columns presents from left to right, in decreasing order of strictness, the average number of phrase matches found for the size category. The final two columns on the right present parallel match data for tokens. Thus for example the column entitled STEM CI shows the average number of stemmed, case-insensitive token matches in a pair of test documents of the size category indicated. Each table in the stack ends with a boldface row that averages statistics across all size categories.</Paragraph>
      <Paragraph position="1"> Inspection of the results in Table 2 leads to these general observations: * With the exception of 200-word summaries falling somewhat short (157 words), each category approaches its target size quite closely;  * Phrases average three tokens in length regardless of summary size; * The objective of relaxing match criteria in the main study was achieved. With few exceptions, each less strict match type produces more hits than its more stringent neighbors; * The much smaller size of the now discontinued 50- and 200-word categories argues against investing much confidence in their data; * Finally, while no effect was found for source  document size (and results for that categorization are therefore not presented), the percentage tables suggest summary size has some limited impact on vocabulary agreement. This effect occurs solely on the phrasal level, most strongly on its strictest measures; token values are effectively flat.</Paragraph>
      <Paragraph position="2"> We are uncertain why this last situation is so.</Paragraph>
      <Paragraph position="3"> Consider only the well-populated 10-word and 100-word summary classes. The effect cannot be accounted for a preponderance of multiple document summaries in either class which might provide more opportunities for matches. Despite many more of these being among the 100-word summaries than the 10-word (1974 single : 1056 multi, versus 116 single : 5456 multi), the percentage of exact phrasal matches is essentially the same in each subcategorization of these classes.</Paragraph>
      <Paragraph position="4"> We speculate that authors may compose the sentences in 100-word summaries in terms of phrases from the source document, while 10-word summaries, which more closely resemble terse headlines, cannot be composed by direct reuse of source document phrases. 50- and 200-word summaries are also composed of sentences. Their exact match percentages approach those of 100-word summaries, lending support to this interpre-</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Data Variance
</SectionTitle>
      <Paragraph position="0"> Whether count or percentage, exclusively average data is presented in Table 2. While measures of central tendency are an important dimension of any population, a full statistical description also requires some indication of measures of variance.</Paragraph>
      <Paragraph position="1"> These appear in Figure 3 which shows, for each of the six phrasal and two token measures, what percentage of the total number of summaries falls into each tenth of the range of possible values. For example, a summary in which 40% of the phrases were exactly matched in the source document would be represented in the figure by the vertical position of the frontmost band over the extent of the decade labeled '4'--24%. The figure's three-dimensional aspect allows the viewer to track which decades have the greatest number of instances as measures move from more strict to more relaxed, front to back.</Paragraph>
      <Paragraph position="2"> However, the most striking message communicated by Figure 3 is that large numbers of summaries have zero values for the stricter measures, EXACT, EXACT CI and PART SUM in particular and PART DOC to a lesser degree. These same measures have their most frequent values around the 50% decade, with troughs both before and after.</Paragraph>
      <Paragraph position="3"> To understand why this is so requires some explanation. Suppose a summary contains two phrases.</Paragraph>
      <Paragraph position="4"> If none are matched in the source its score is 0%.</Paragraph>
      <Paragraph position="5"> If one is matched its score is 50%; if both, 100%.</Paragraph>
      <Paragraph position="6"> A summary with three phrases has four possible percentage values: 0%, 33%, 66% and 100%. The 'hump' of partial matching is thus around the fifty percent level because most summaries are ten words, and have only 1 or 2 candidates to be matched. The ranges involved in the stricter measures are not large.</Paragraph>
      <Paragraph position="7"> That acknowledged, we can see that the modal or most frequent decade does indeed tend in an irregular way to move from left to right, from zero to 100 percent, as measures become less strict. In making this observation, note that the two backmost bands represent measures on tokens, a different syntactic element than the phrase. The information about the distribution of summary measures shown in this figure is not unexpected.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Key Findings
</SectionTitle>
      <Paragraph position="0"> The central fact that these data communicate quite clearly is that summaries do not employ many of the same phrases their source documents do, and even fewer than do other summaries. In particular, on average only 37% of summary phrases appear in the source document, while summaries share only 9% of their phrases. This becomes more understandable when we note that on average only 55% of the individual words used in summaries, both common vocabulary terms and proper names, appear in the source document; and between summaries, on average only 22% are found in both.</Paragraph>
      <Paragraph position="1"> It may be argued that the lower counts for inter-summary vocabula ry agreement can be explained thus: since a summary is so much smaller than its source document, lower counts should result. One reply to that argument is that, while acknowledging that synonymy, generalization and specialization would augment the values found, the essence of a generic summary is to report the pith, the gist, the central points, of a document and that these key elements should not vary so widely from one summary to the next.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Pertinent Research
</SectionTitle>
    <Paragraph position="0"> Previous research addressing summary vocabulary is limited, and most has been undertaken in connection with another issue: either with the problem of evaluating summary quality (Mani, 2001; Lin and Hovy, 2002) or to assess sentence element suitability for use in a summary (Jing and McKeown, 1999). In such a case results arise as a by-product of the main line of research and conclusions about vocabulary must be inferred from other findings.</Paragraph>
    <Paragraph position="1"> Mani (2001) reports that &amp;quot;previous studies, most of which have focused on extracts, have shown evidence of low agreement among humans as to which sentences are good summary sentences.&amp;quot; Lin and Hovy's (2002) discovery of low inter-rater agreement in single (~40%) and multiple (~29%) summary evaluation may also pertain to our findings. It stands to reason that individua ls who disagree on sentence pertinence or do not rate the same summary highly are not likely to use the same words to write the summary. In the very overt rating situation they describe, Lin and Hovy were also able to identify human error and quantify it as a significant factor in rater performance. This reality may introduce variance as a consequence of suboptimal performance: a writer may simply fail to use the mot juste .</Paragraph>
    <Paragraph position="2"> In contrast, Jing, McKeown, Barzilay and Elhadad (1998) found human summarizers to be 'quite consistent' as to what should be included, a result they acknowledge to be 'surprisingly high'. Jing et al. note that agreement drops off with summary length, that their experience is somewhat at variance with that of other researchers, and that this may be accounted for in part by regularity in the structure of the documents summarized.</Paragraph>
    <Paragraph position="3"> Observing that &amp;quot;expert summarizers often reuse the text in the original document to produce a summary&amp;quot; Jing and McKeown (1999) analyzed 300 human written summaries of news articles and found that &amp;quot;a significant portion (78%) of summary sentences produced by humans are based on cut-and-paste&amp;quot;, where 'cut-and-paste' indicates vocabulary agreement. This suggests that 22% of summary sentences are not produced in this way; and the authors report that 315 (19%) sentences do not match any sentence in the document. null In their 2002 paper, Lin and Hovy examine the use of multiple gold standard summaries for summarization evaluation, and conclude &amp;quot;we need more than one model summary although we cannot estimate how many model summaries are required to achieve reliable automated summary evaluation&amp;quot;.</Paragraph>
    <Paragraph position="4"> Attempting to answer that question, van Halteren and Teufel (2003) conclude that 30 to 40 manual summaries should be sufficient to establish a stable consensus model summary. Their research, which directly explores the differences and similarities between various human summaries to establish a basis for such an estimate, finds great variation in summary content as reflected in factoids3. This variation does not fall off with the number of summaries and accordingly no two summaries correlate highly. Although factoid measures did not correlate highly with those of unigrams (tokens), the former did clearly demonstrate an importance hierarchy which is an essential condition if a consensus model summary is to be constructed. Our work can thus be seen as confirming that, in large measure, van Halteren and Teufel's findings apply to the DUC corpus of manual summaries.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> We began this study to test two hypotheses. The first is this: automatic summarization is made difficult to the degree that manually-written summaries do not limit themselves to the vocabulary of the source document. For a summarization system 3 A factoid is an atomic semantic unit corresponding to an expression in first-order predicate logic. As already noted we approximate phrases to factoids.</Paragraph>
    <Paragraph position="1"> to incorporate words which do not appear in the source document requires at a minimum that it has a capacity to substitute a synonym of some word in the text, and some justification for doing so.</Paragraph>
    <Paragraph position="2"> More likely it would involve constructing a representation of the text's meaning and reasoning (generalization, inferencing) on the content of that representation. The latter are extremely hard tasks.</Paragraph>
    <Paragraph position="3"> Our second hypothesis is that automatic summarization is made difficult to the degree that manually written summaries do not agree among themselves. While the variety of possible disagreements are multifarious, the use of different vocabulary is a fundamental measure of semantic heterogeneity. Authors cannot easily talk of the same things if they do not use words in common.</Paragraph>
    <Paragraph position="4"> Unfortunately, our study of the DUC manual summaries and their source documents provides substantial evidence that summarization of these documents remains difficult indeed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML