XML Viewer - i05-2009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2009_metho.xml
Size: 19,304 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2009">
  <Title>Analysis and modeling of manual summarization of Japanese broadcast news</Title>
  <Section position="4" start_page="0" end_page="49" type="metho">
    <SectionTitle>
* Abstractor
</SectionTitle>
    <Paragraph position="0"> The original news is written by NHK reporters, and the text is summarized by different writers, i.e., professional abstractors. Most professional abstractors are retired reporters who have expertise in writing news.</Paragraph>
    <Paragraph position="1"> * Compression rate and time allowance The original news is compressed to a maximum length of 105 Japanese characters. We will  show in section 4 that the average compression rate is about 22.5%. The upper bound is decided from the display design of the data service of digital TV broadcasting. The abstractors must work quickly because the summary news must be broadcast promptly.</Paragraph>
  </Section>
  <Section position="5" start_page="49" end_page="49" type="metho">
    <SectionTitle>
* Techniques
</SectionTitle>
    <Paragraph position="0"> The abstractors use only information contained in the original news. They scan the original news quickly and repeatedly, not to understand the full content, but to select the parts to be used in the summary news. The abstractors' special reading tendency has been reported in (Mani, 2001), and we can say the same tendency was observed in our Japanese abstractors. The abstractors focus on the lead (the opening part) of the original news. They sometimes use the end part of the original news.</Paragraph>
  </Section>
  <Section position="6" start_page="49" end_page="11221" type="metho">
    <SectionTitle>
3 Corpus construction
</SectionTitle>
    <Paragraph position="0"> We planned the summary news corpus as a resource to investigate the manual summarization process and to look into the possibility of an automatic summarization system for broadcast news. We obtained 18,777 pieces of summary news from NHK. Although each piece is a summary of a particular original news text, the link between the summary and the original news is not available.</Paragraph>
    <Paragraph position="1"> We matched the summary and original news and constructed a corpus. There have been several attempts to construct &lt;summary text, original text&gt; corpora (Marcu, 1999; Jing and McKeown, 1999). We decided to use the method proposed by Jing and McKeown (1999) for the reasons given below.</Paragraph>
    <Paragraph position="2"> As our abstractors mentioned that they used only information available in the original news, we hypothesize that the summary and the original news share many surface words. This indicates that the surface-word-based matching methods such as (Marcu, 1999; Jing and McKeown, 1999) will be effective.</Paragraph>
    <Paragraph position="3"> In particular, the word position matching realized in (Jing and McKeown, 1999) seems especially useful. We thought that we might be able to observe the summarization process precisely by tracing the word position links, and we employed their work with a little modification.</Paragraph>
    <Paragraph position="4"> As a result, our corpus takes the form of the triple: &lt;summary, original, word position correspondence&gt;. null</Paragraph>
    <Section position="1" start_page="49" end_page="11221" type="sub_section">
      <SectionTitle>
3.1 Matching algorithm
</SectionTitle>
      <Paragraph position="0"> Jing and McKeown (1999) treated a word matching problem between a summary and its text, which they called the summary decomposition problem. They employed a statistical model (briefly described below) and obtained good results when they tested their method with the Ziff-Davis corpus. In the following explanation, we use the notion of summary and text instead of summary news and original news for simplicity. null  (1) The word position in a summary is represented by &lt;I&gt;.</Paragraph>
      <Paragraph position="1"> (2) The word position in the text is represented by a pair of the sentence position (S) and the word position in a sentence (W) as in &lt;S, W&gt;.</Paragraph>
      <Paragraph position="2"> (3) Each summary word is checked as to whether it appears in the text. If it appears, all of the positions in the text are stored in the form of &lt;S,W&gt; to form a position trellis. (4) Scan the n summary words from left to right and find the path on the trellis that maximizes the score of formula (1).</Paragraph>
      <Paragraph position="4"> This formula is the repeated product of the probability that the two adjacent words in a</Paragraph>
      <Paragraph position="6"> ) in the text, respectively. This quantity represents the goodness of the summary and the text word matching. As a result, the path on the trellis with the maximum probability gives the overall most likely word position match.</Paragraph>
      <Paragraph position="7"> Jing and McKeown (1999) assigned six-grade heuristic values to the probability. The highest probability of 1.0 was given when two adjacent words in a summary appear at adjacent positions in the same sentence of the text. The lowest probability of 0.5 was given when two adjacent words in a summary appear in different sentences in the text with a certain distance or greater. We fixed the distance at two sentences, considering the average sentence count of the original news texts.</Paragraph>
      <Paragraph position="8">  signed to treat a fixed summary and text pair and needs some modification to be applied to our two-fold problem of finding the original news of a given summary news from a large collection of news together with the word position matching. null Their method has a special treatment for a summary word that does not appear in the text. It assumes that such a word does not exist in the summary and therefore skips the trellis at this word with a probability of 1. This unfavorably biases news text that contains fewer matching words. To alleviate this problem, we experimentally found that the probability score of 0.55 works well for such a case (This score was the second smallest of the original six-grade score). We developed a word match browser to precisely check the words of the summary and original news.</Paragraph>
    </Section>
    <Section position="2" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
3.2 Summary and original news matching
</SectionTitle>
      <Paragraph position="0"> We matched 18,777 summary news texts from November 2003 to June 2004 against the news database, which mostly covers the original news of the period. We followed the procedures below.</Paragraph>
      <Paragraph position="1"> * Numerical expression normalization Numerical expressions in the original news are written in Chinese numerals (Kanji) and those of the summary news are written in Arabic numerals. We normalized the Chinese numerals into Arabic numerals.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="11221" end_page="11221" type="metho">
    <SectionTitle>
* Morphological analysis
</SectionTitle>
    <Paragraph position="0"> The summary and original news were morphologically analyzed. We used morphemes as a matching unit. In this paper, we will use morphemes and words interchangeably.</Paragraph>
  </Section>
  <Section position="8" start_page="11221" end_page="11221" type="metho">
    <SectionTitle>
* Search span
</SectionTitle>
    <Paragraph position="0"> Each summary news was matched against the news written in the three-day period before the summary was written. This period was chosen experimentally.</Paragraph>
  </Section>
  <Section position="9" start_page="11221" end_page="11221" type="metho">
    <SectionTitle>
4 Results and observation
</SectionTitle>
    <Paragraph position="0"> We randomly checked the news matching results and found more than 90% were correct.</Paragraph>
    <Paragraph position="1"> Some of the summaries were exceptionally long, and we consider that such noisy data was the main reason for incorrect matching. Figure 1 shows a matching example. The underlined (line and broken line) sentences show the word position match.</Paragraph>
    <Paragraph position="2"> The word matching is not easy to evaluate because we do not have the correct matching answer. Although there are some problems in the matching, most of the results seem to be good enough for approximate analysis. The following discussion assumes that the word matching is correct.</Paragraph>
    <Section position="1" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
4.1 Compression rate
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the basic statistics of the summary and its corresponding original news.</Paragraph>
      <Paragraph position="1">  We can see that the average compression rate is 22.5% in terms of characters. The average summary news length (109.9 characters per news text) was longer than what we were told (105, see section 2).</Paragraph>
      <Paragraph position="2"> We then checked the length of the typical summary texts. We found that the cumulative relative frequency of the summary text with the sentence count from 1 to 4 was 0.99 and was quite dominant. We checked the average length of these summaries and obtained 105.4, which is close to what we were told. We guess that noisy &amp;quot;long summaries&amp;quot; skewed the figure.</Paragraph>
    </Section>
    <Section position="2" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
Original Summary
</SectionTitle>
      <Paragraph position="0"> text counts 18,777 Ave. sent. count/text 5.13 1.63 Ave. text length (char.) 487.7 109.9 Ave. first line length (char.) 94.9 81.3</Paragraph>
    </Section>
    <Section position="3" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
4.2 Word match ratio
</SectionTitle>
      <Paragraph position="0"> We measured how many of the summary words came from original news. As our matching result contains word-to-word correspondence, we calculated the ratio of the matched words in a summary text. Table 2 shows a part of the result. It shows the relative frequency of the summary news in which 100% of the words came from the original news reached 0.265 and those that had more than 90% reached 0.970.</Paragraph>
      <Paragraph position="1">  This strongly suggests that most of the summary news is the &amp;quot;extract&amp;quot; (Mani, 2001), which is written using only vocabulary appearing in the original news. This result is in accord with what the abstractors told us.</Paragraph>
    </Section>
    <Section position="4" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
4.3 Summary word employment in the
</SectionTitle>
      <Paragraph position="0"> original news sentences The previous section indicated that our summary likely belongs to the extract type. Where in the original news do these words come from? We next measured the word employment ratio of each sentence in the original news and the result is presented in Figure 2.</Paragraph>
      <Paragraph position="1"> In this graph, the original news is categorized into five cases according to its sentence count from 4 to 8  and the average word employment ratio is shown for each sentence.</Paragraph>
      <Paragraph position="2"> Of this figure, the following observations can be made: * Bias toward the first sentence In all five cases, the first sentence recorded the highest word employment ratio. The percentages of the second and third sentences increase when the news contains many sentences. The opening part of the news text is called the lead. We will discuss its role in the next section. * No clear favorite for the final sentence There was no employment ratio rise for the closing sentences in any case even though our abstractors indicated they often use information in the last sentence. This inconsistency may be due to the word match error. Final sentences actually have an important role in news, as we will see in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="11221" end_page="11221" type="metho">
    <SectionTitle>
5 Summarization model
</SectionTitle>
    <Paragraph position="0"> In the previous section, we found a quite high word overlap between a summary and the opening part of the original news text. We checked with our word match browser the similarity of the summary news and lead sentences, and found that most of the summary sentences  These news texts cover the 88 % of the total news texts.  take exactly the same syntactic pattern of the opening sentence. Based on this observation and what we found in the interviews, we devised a news text summarization model. The model can explain our abstractors' behavior, and we are planning to develop an automatic or semi-automatic summarization system with it. We will explain the typical news text structure and present our model.</Paragraph>
    <Section position="1" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
5.1 News text structure
</SectionTitle>
      <Paragraph position="0"> Most of our news texts are written with a three-part structure, i.e., lead, body and supplement. Figure 1 shows the two-fold structure of the lead and the body. Each part has the following characteristics.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="11221" end_page="11221" type="metho">
    <SectionTitle>
* Lead
</SectionTitle>
    <Paragraph position="0"> The most important information is briefly described in the opening part of a news text. This part is called the lead. Proper nouns are often avoided in favor of more abstract expressions such as &amp;quot;a young man&amp;quot; or &amp;quot;a big insurance company.&amp;quot; The lead is usually written in one or two sentences.</Paragraph>
  </Section>
  <Section position="12" start_page="11221" end_page="11221" type="metho">
    <SectionTitle>
* Body
</SectionTitle>
    <Paragraph position="0"> The lead is detailed in the body. The 5W1H information is mainly elaborated, and the proper names that were vaguely mentioned in the lead appear here. The statements of people involved in the news sometimes appear here. The repetitive structure of the lead and the body is rooted in the nature of radio news; listeners cannot go back to the previous part if they missed the information. null</Paragraph>
  </Section>
  <Section position="13" start_page="11221" end_page="11221" type="metho">
    <SectionTitle>
* Supplement
</SectionTitle>
    <Paragraph position="0"> Necessary information that has not been covered in the lead and the body is placed here.</Paragraph>
    <Paragraph position="1"> Take for an example of weather news about a typhoon. A caution from the Meteorological agency is sometimes added after the typhoon's movement has been described.</Paragraph>
    <Section position="1" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
5.2 Model
</SectionTitle>
      <Paragraph position="0"> We found that most of the summary news is written based on the lead sentences. They are then shortened or partly modified with the expressions in the body to make them more informative and self-contained.</Paragraph>
      <Paragraph position="1"> The essential operation, we consider, lies in the editing of the lead sentences under the summary length constraint. Based on the observation, we have proposed a two-step summarization model of reading and editing. The summary in Figure 1 is constructed with the lead sentence with the insertion of a phrase in the body.</Paragraph>
      <Paragraph position="2">  * Reading phase (1) Identify the lead, the body and the supplement sentences in the original news.</Paragraph>
      <Paragraph position="3"> (2) Analysis Find the correspondences between the parts in the lead and those in the body. We can regard this process as a co-reference resolution. * Summary editing phase (3) Set the lead sentence as the base sentence of the summary.</Paragraph>
      <Paragraph position="4"> (4) Apply the following operations until the base sentence length is close enough to the predefined length N.</Paragraph>
      <Paragraph position="5"> (4-1) Delete parts in the base sentence.</Paragraph>
      <Paragraph position="6"> (4-2) Substitute parts in the base sentence  with the corresponding parts in the body with the results of (2).</Paragraph>
      <Paragraph position="7"> (4-2') Add a body part to the base sentence. We may view this as a null part substituted by a body part.</Paragraph>
      <Paragraph position="8"> (4-3) Add supplement sentences.</Paragraph>
      <Paragraph position="9"> The supplement is often included in a summary; this part contains different information from the other parts.</Paragraph>
    </Section>
    <Section position="2" start_page="11221" end_page="11221" type="sub_section">
      <SectionTitle>
5.3 Related works and discussion
</SectionTitle>
      <Paragraph position="0"> Our two-step model essentially belongs to the same category as the works of (Mani et al., 1999) and (Jing and McKeown, 2000). Mani et al. (1999) proposed a summarization system based on the &amp;quot;draft and revision.&amp;quot; Jing and McKeown (2000) proposed a system based on &amp;quot;extraction and cut-and-paste generation.&amp;quot; Our abstractors performed the same cut-and-paste operations that Jing and McKeown noted in their work, and we think that our two-step model will be a reasonable starting point for our subsequent research. Below are some of our observations.</Paragraph>
      <Paragraph position="1">  The lead sentences play a central role in our model since they serve as the base of the final summary. Their identification can be achieved with the same techniques as used for the important sentence extraction. In our case, the sentence position information plays an important role as was shown by Kato and Uratani (2000).</Paragraph>
      <Paragraph position="2"> We consider the identification of the body and the supplement part together with the lead will be beneficial for the co-reference resolution.</Paragraph>
      <Paragraph position="3"> The co-reference resolution problem between the lead and the body should be treated in a more general way than usual. We found that our problem ranges from the word level, the correspondence between named entities and their abstract paraphrases, to the sentence level, an entire statement of a person and its short paraphrase. We are now investigating the types of co-reference that we have to cover.</Paragraph>
      <Paragraph position="4"> We found that the deletion of lead parts did not occur very often in our summary, unlike the case of Jing and McKeown (2000). One reason is that most of our leads were short enough  to be included in the summary and therefore the substitution operation became conspicuous. This usually increased the length of summary but contributed to making it more lively and informative. null A supplement part was often included in the summary. We consider that this feature corresponds to the abstractors' comments on employment of the final sentence, which was not clearly detected in our statistical investigation described in section 4.3. We are now investigating the conditions for including the supplement. We have so far listed the basic operations of editing through the manual checking of samples, and we are currently analyzing the operations with more examples. We will then study automatic selection of the optimum operation sequence to achieve the most informative and natural summary.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML