File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/j02-4006_intro.xml
Size: 4,226 bytes
Last Modified: 2025-10-06 14:01:23
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-4006"> <Title>Using Hidden Markov Modeling to Decompose Human-Written Summaries</Title> <Section position="4" start_page="530" end_page="531" type="intro"> <SectionTitle> 2 It is, of course, possible that a summary sentence has not been constructed by cut and paste even if </SectionTitle> <Paragraph position="0"> more than half of the words in the sentence are from the original document.</Paragraph> <Section position="1" start_page="531" end_page="531" type="sub_section"> <SectionTitle> Jing Decomposing Human-Written Summaries 3.1 Formulating the Problem </SectionTitle> <Paragraph position="0"> We first mathematically formulate the summary sentence decomposition problem. An input summary sentence can be represented as a word sequence: (I</Paragraph> <Paragraph position="2"> is the first word of the sentence and I N is the last word. The position of a word in a document can be uniquely represented by the sentence position and the word position within the sentence: (SNUM, WNUM). For example, (4, 8) uniquely refers to the eighth word in the fourth sentence. Multiple occurrences of a word in the document can be represented by a set of word positions: {(SNUM )} for each word in the sequence, determine the most likely document position for each word.</Paragraph> <Paragraph position="3"> Through this formulation, we transform the difficult tasks of identifying phrase boundaries and determining phrase origins into the problem of finding a most likely document position for each word. As shown in Figure 1, when a position has been chosen for each word in the summary sequence, we obtain a sequence of positions. For example, ((0,21), (2,40), (2,41), (0,31)) is our position sequence when the first occurrence of the same word in the document has been chosen for every summary word; ((0,26), (2,40), (2,41), (0,31)) is another position sequence. Every time a different position is chosen for a summary word, we obtain a different position sequence. The word the in the sequence occurs 44 times in the document, communication occurs once, subcommittee occurs twice, and of occurs 22 times. This four-word sequence therefore has a total of 1,936 (44 x 1 x 2 x 22) possible position sequences.</Paragraph> <Paragraph position="4"> Morphological analysis or stemming can be performed to associate morphologically related words, but it is optional. In our experiments, applying stemming improved system performance when the human-written summaries included many words that were morphological variants of original-document words. Many human-written summaries in our experiments, however, contained few cases of morphological transformation of words and phrases borrowed from original documents, so stemming did not improve the performance for these summaries.</Paragraph> <Paragraph position="5"> Finding the most likely document position for each word is equivalent to finding the most likely position sequence among all possible position sequences. For the example in Figure 1, the most likely position sequence should be ((2,39), (2,40), (2,41), (2,42)); that is, the fragment comes from document sentence 2 and its position within the sentence is word number 39 to word number 42. How can we automatically find this sequence, however, among 1,936 possible sequences?</Paragraph> </Section> <Section position="2" start_page="531" end_page="531" type="sub_section"> <SectionTitle> 3.2 The Hidden Markov Model </SectionTitle> <Paragraph position="0"> The exact document position from which a word in a summary comes depends on the word positions surrounding it. Using the bigram model, we assume that the probability of a word's coming from a certain position in the document depends only on the word directly before it in the sequence. Suppose I</Paragraph> <Paragraph position="2"> comes from sentence number S and word number</Paragraph> <Paragraph position="4"> of the document when I i comes from sentence number S</Paragraph> <Paragraph position="6"> To decompose a summary sentence, we must consider how humans are likely to generate it; we draw here on the revision operations discussed in section 2. Two</Paragraph> </Section> </Section> class="xml-element"></Paper>