File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2109_metho.xml

Size: 19,766 bytes

Last Modified: 2025-10-06 14:10:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2109">
  <Title>Trimming CFG Parse Trees for Sentence Compression Using Machine Learning Approaches</Title>
  <Section position="4" start_page="851" end_page="852" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="851" end_page="852" type="sub_section">
      <SectionTitle>
3.1 Maximum Entropy Model for Sentence
Compression
</SectionTitle>
      <Paragraph position="0"> We describe a maximum entropy method as a natural extension of Knight and Marcu's noisy-channel model (Knight and Marcu, 2000). Knight and Marcu's method uses only mother and daughter local relations in CFG parse trees. Therefore, it sometimes eliminates the meanings of the original sentences. For example, their method cannot distinguish never and always because these two adverbs are assigned the same non-terminals in parse trees. However, if never is removed from a sentence, the meaning of the sentence completely changes. Turner and Charniak (Turner and Charniak, 2005) revised and improved Knight and Marcu's algorithm; however, their algorithm also uses only mother and daughter relations and has the same problem. We use other information as feature functions of the maximum entropy model, and this model can deal with many features more appropriately than using simple frequency.</Paragraph>
      <Paragraph position="1"> Suppose that we trim a node in the original full parse tree. For example, suppose we have a mother node A and daughter nodes (B C D) that are derived using a CFG rule. We must leave at least one non-terminal in the daughter nodes. The trim candidates of this rule are the members of the set of subsequences, Y, of (B C D), or the seven non-terminal sequences below:</Paragraph>
      <Paragraph position="3"> probability, P(yjY) = Ptrim(A ! B CjA ! B C D), is calculated by using the maximum entropy model. We assume that these joint events are independent of each other and calculate the probability that an original sentence, l, is compressed to Description  1 the mother node 2 the current node 3 the daughter node sequence in the original sentence and which daughters are removed 4 the daughter node sequence in the compressed sentence null 5 the number of daughter nodes 6 the depth from the root 7 the daughter non-terminals that are removed 8 the daughter terminals that are removed 9 whether the daughters are negative adverbs , and removed 10 tri-gram of daughter nodes 11 only one daughter exists, and its non-terminal is the same as that of the current node 12 only one daughter exists, and its non-terminal is the same as that of the mother node 13 how many daughter nodes are removed 14 the number of terminals the current node contains 15 whether the head daughter is removed 16 the left-most and the right-most daughters 17 the left and the right siblings  s as the product of all trimming probabilities, like in Knight and Marcu's method.</Paragraph>
      <Paragraph position="5"> where R is the set of compressed and original rule pairs in joint events. Note that our model does not use Bayes' Rule or any language models.</Paragraph>
      <Paragraph position="6"> For example, in Figure 1, the trimming probability is calculated as below:</Paragraph>
      <Paragraph position="8"> To represent all summary candidates, we create a compression forest as Knight and Marcu did.</Paragraph>
      <Paragraph position="9"> We select the tree assigned the highest probability from the forest.</Paragraph>
      <Paragraph position="10"> Features in the maximum entropy model are dened for a tree node and its surroundings. When we process one node, or one non-terminal x, we call it the current node. We focus on not only x and its daughter nodes, but its mother node, its sibling nodes, terminals of its subtree and so on. The features we used are listed in Table 1.</Paragraph>
      <Paragraph position="11"> Knight and Marcu divided the log probabilities by the length of the summary. We extend this idea so that we can change the output length exibly.</Paragraph>
      <Paragraph position="12"> We introduce a length parameter, a, and de ne a score Sa as Sa(s) = length(s)a log P(sjl), where l is an input sentence to be shortened, and s is a  summary candidate. Because log P(sjl) is negative, short sentences obtain a high score for large a, and long ones get a low score. The parameter a can be negative or positive, and we can use it to control the average length of outputs.</Paragraph>
    </Section>
    <Section position="2" start_page="852" end_page="852" type="sub_section">
      <SectionTitle>
3.2 Bottom-Up Method
</SectionTitle>
      <Paragraph position="0"> As explained in Section 2.1, in Knight and Marcu's method, both original and compressed sentences are parsed, and correspondences of CFG rules are identi ed. However, when the daughter nodes of a compressed rule are not a subsequence of the daughter nodes in the original one, the method cannot learn this joint event. A complex sentence is a typical example. A complex sentence is a sentence that includes another sentence as a part. An example of a parse tree of a complex sentence and its compressed version is shown in Figure 2. When we extract joint events from these two trees, we cannot match the two root nodes because the sequence of the daughter nodes of the root node of the compressed parse tree, (NP ADVP VP .), is not a subsequence of the daughter nodes of the original parse tree, (S , NP VP .). Turner and Charniak (Turner and Charniak, 2005) solve this problem by appending special rules that are applied when a mother node and its daughter node have the same label. However, there are several types of such problems like Figure 2. We need to extract these structures from a training corpus.</Paragraph>
      <Paragraph position="1"> We propose a bottom-up method to solve the problem explained above. In our method, only original sentences are parsed, and the parse trees of compressed sentences are extracted from the original parse trees. An example of this method is shown in Figure 3. The original sentence is 'd g h f c', and its compressed sentence is 'd g c'.</Paragraph>
      <Paragraph position="2"> First, each terminal in the parse tree of the original sentence is marked if it exists in the compressed sentence. In the gure, the marked terminals are represented by circles. Second, each non-terminal in the original parse tree is marked if it has at least one marked terminal in its sub-trees. These are represented as bold boxes in the gure. If non-terminals contain marked non-terminals in their sub-trees, these non-terminals are also marked recursively. These marked non-terminals and terminals compose a tree structure like that on the right-hand side in the gure. These non-terminals represent joint events at each node.</Paragraph>
      <Paragraph position="3">  Note that this tree is not guaranteed to be a grammatical parse tree by the CFG grammar. For example, from the tree of Figure 2,</Paragraph>
      <Paragraph position="5"/>
    </Section>
  </Section>
  <Section position="5" start_page="852" end_page="855" type="metho">
    <SectionTitle>
4 Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="852" end_page="853" type="sub_section">
      <SectionTitle>
4.1 Evaluation Method
</SectionTitle>
      <Paragraph position="0"> We evaluated each sentence compression method using word F-measures, bigram F-measures, and BLEU scores (Papineni et al., 2002). BLEU scores are usually used for evaluating machine translation quality. A BLEU score is de ned as the weighted geometric average of n-gram precisions with length penalties. We used from unigram to 4-gram precisions and uniform weights for the BLEU scores.</Paragraph>
      <Paragraph position="1"> ROUGE (Lin, 2004) is a set of recall-based criteria that is mainly used for evaluating summarization tasks. ROUGE-N uses average N-gram recall, and ROUGE-1 is word recall. ROUGE-L uses the length of the longest common subsequence (LCS) of the original and summarized sentences.</Paragraph>
      <Paragraph position="2"> In our model, the length of the LCS is equal to the number of common words, and ROUGE-L is equal to the unigram F-measure because words are not rearranged. ROUGE-L and ROUGE-1 are supposed to be appropriate for the headline gener- null ation task (Lin, 2004). This is not our task, but it is the most similar task in his paper.</Paragraph>
      <Paragraph position="3"> We also evaluated the methods using human judgments. The evaluator is not the author but not a native English speaker. The judgment used the same criteria as those in Knight and Marcu's methods. We performed two experiments. In the rst experiment, evaluators scored from 1 to 5 points the grammaticality of the compressed sentence. In the second one, they scored from 1 to 5 points how well the compressed sentence contained the important words of the original one.</Paragraph>
      <Paragraph position="4"> We used the parallel corpus used in Ref. (Knight and Marcu, 2000). This corpus consists of sentence pairs extracted automatically from the Ziff-Davis corpus, a set of newspaper articles about computer products. This corpus has 1087 sentence pairs. Thirty-two of these sentences were used for the human judgments in Knight and Marcu's experiment, and the same sentences were used for our human judgments. The rest of the sentences were randomly shuf ed, and 527 sentence pairs were used as a training corpus, 263 pairs as a development corpus, and 264 pairs as a test corpus. To parse these corpora, we used Charniak and</Paragraph>
    </Section>
    <Section position="2" start_page="853" end_page="853" type="sub_section">
      <SectionTitle>
Johnson's parser (Charniak and Johnson, 2005).
4.2 Settings of Two Experiments
</SectionTitle>
      <Paragraph position="0"> We experimented with/without goal sentence length for summaries.</Paragraph>
      <Paragraph position="1"> In the rst experiment, the system was given only a sentence and no sentence length information. The sentence compression problem without the length information is a general task, but evaluating it is dif cult because the correct length of a summary is not generally de ned even by humans.</Paragraph>
      <Paragraph position="2"> The following example shows this.</Paragraph>
      <Paragraph position="3"> Original: A font, on the other hand, is a subcategory of a typeface, such as Helvetica Bold or Helvetica Medium.</Paragraph>
      <Paragraph position="4"> Human: A font is a subcategory of a typeface, such as Helvetica Bold.</Paragraph>
      <Paragraph position="5"> System: A font is a subcategory of a typeface.</Paragraph>
      <Paragraph position="6"> The such as phrase is removed in this system output, but it is not removed in the human summary. Neither result is wrong, but in such situations, the evaluation score of the system decreases. This is because the compression rate of each algorithm is different, and evaluation scores are affected by the lengths of system outputs. For this reason, results with different lengths cannot be  compared easily. We therefore examined the relations between the average compression ratios and evaluation scores for all methods by changing the system summary length with the different length parameter a introduced in Section 3.1.</Paragraph>
      <Paragraph position="7"> In the second experiment, the system was given a sentence and the length for the compressed sentence. We compressed each input sentence to the length of the sentence in its goal summary. This sentence compression problem is easier than that in which the system can generate sentences of any length. We selected the highest-scored sentence from the sentences of length l. Note that the recalls, precisions and F-measures have the same scores in this setting.</Paragraph>
    </Section>
    <Section position="3" start_page="853" end_page="854" type="sub_section">
      <SectionTitle>
4.3 Results of Experiments
</SectionTitle>
      <Paragraph position="0"> The results of the experiment without the sentence length information are shown in Figure 4, 5 and 6. Noisy-channel indicates the results of the noisy-channel model, ME indicates the results of the maximum-entropy method, and ME + bottom-up indicates the results of the maximum-entropy  method with the bottom-up method. We used the length parameter, a, introduced in Section 3.1, and obtained a set of summaries with different average lengths. We plotted the compression ratios and three scores in the gures. In these gures, a compression ratio is the ratio of the total number of words in compressed sentences to the total number of words in the original sentences.</Paragraph>
      <Paragraph position="1"> In these gures, our maximum entropy methods obtained higher scores than the noisy-channel model at all compression ratios. The maximum entropy method with the bottom-up method obtain the highest scores on these three measures.</Paragraph>
      <Paragraph position="2"> The results of the experiment with the sentence length information are shown in Figure 7. In this experiment, the scores of the maximum entropy methods were higher than the scores of the noisy-channel model. The maximum entropy method with the bottom-up method achieved the highest scores on each measure.</Paragraph>
      <Paragraph position="3"> The results of the human judgments are shown in Table 2. In this experiment, each length of output is same as the length of goal sentence. The  maximum entropy with the bottom-up method obtained the highest scores of the three methods. We did t-tests (5% signi cance). Between the noisy-channel model and the maximum entropy with the bottom-up method, importance is signi cantly different but grammaticality is not. Between the human and the maximum entropy with the bottom-up method, grammaticality is signi cantly different but importance is not. There are no signi cant differences between the noisy-channel model and the maximum entropy model.</Paragraph>
      <Paragraph position="4">  One problem of the noisy-channel model is that it cannot distinguish the meanings of removed words. That is, it sometimes removes semantically important words, such as not and never , because the expansion probability depends only on non-terminals of parent and daughter nodes.</Paragraph>
      <Paragraph position="5"> For example, our test corpus includes 15 sentences that contain not . The noisy-channel model removed six not s, and the meanings of the sentences were reversed. However, the two maximum entropy methods removed only one not because they have negative adverb as a feature in their models. The rst example in Table 3 shows one of these sentences. In this example, only Noisy-channel removed not .</Paragraph>
    </Section>
    <Section position="4" start_page="854" end_page="855" type="sub_section">
      <SectionTitle>
4.3.2 Effect of Bottom-Up Method
</SectionTitle>
      <Paragraph position="0"> Our bottom-up method achieved the highest accuracy, in terms of F-measures, bigram Fmeasures, BLEU scores and human judgments.</Paragraph>
      <Paragraph position="1"> The results were fairly good, especially when it summarized complex sentences, which have sentences as parts. The second example in Table 3 is a typical complex sentence. In this example, only ME + bottom-up correctly remove he said .</Paragraph>
      <Paragraph position="2"> Most of the complex sentences were correctly compressed by the bottom-up method, but a few sentences like the third example in Table 3 were not. In this example, the original sentence was parsed as shown in Figure 8 (left). If this sentence is compressed to the human output, its parse tree has to be like that in Figure 8 (middle) using  Original a file or application '' alias '' similar in effect to the ms-dos path statement provides a visible icon in folders where an aliased application does not actually reside .</Paragraph>
      <Paragraph position="3"> Human a file or application alias provides a visible icon in folders where an aliased application does not actually reside .</Paragraph>
      <Paragraph position="4">  Noisy-channel null a similar in effect to ms-dos statement provides a visible icon in folders where an aliased application does reside .</Paragraph>
      <Paragraph position="5"> ME a or application alias statement provides a visible icon in folders where an aliased application does not actually reside .</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="855" end_page="856" type="metho">
    <SectionTitle>
ME +
</SectionTitle>
    <Paragraph position="0"> bottom-up a file or application statement provides a visible icon in folders where an aliased application does not actually reside .</Paragraph>
    <Paragraph position="1"> Original the user can then abort the transmission , he said .</Paragraph>
    <Paragraph position="2"> Human the user can then abort the  our method. When a parse tree is too long from the root to the leaves like this, some nodes are trimmed but others are not because we assume that each trimming probability is independent. The compressed sentence is ungrammatical, as in the third example in Table 3.</Paragraph>
    <Paragraph position="3"> We have to constrain such ungrammatical sentences or introduce another rule that reconstructs a short tree as in Figure 8 (right). That is, we introduce a new transformation rule that compresses</Paragraph>
    <Paragraph position="5"/>
    <Section position="1" start_page="855" end_page="855" type="sub_section">
      <SectionTitle>
4.4 Comparison with Original Results
</SectionTitle>
      <Paragraph position="0"> We compared our results with Knight and Marcu's original results. They implemented two methods: one is the noisy-channel model and the other is a decision-based model. Each model produced 32 compressed sentences, and we calculated Fmeasures, bigram F-measures, and BLEU scores.</Paragraph>
      <Paragraph position="1"> We used the length parameter a = 0.5 for the maximum-entropy method and a = 0.25 for  the maximum-entropy method with the bottom-up method. These two values were determined using experiments on the development set, which did not contain the 32 test sentences.</Paragraph>
      <Paragraph position="2"> The results are shown in Table 4. Noisy-channel indicates the results of Knight and Marcu's noisy-channel model, and Decision-based indicates the results of Knight and Marcu's decision-based model. Comp. indicates the compression ratio of each result. Our two methods achieved higher accuracy than the noisy-channel model. The results of the decision-based model and our maximum-entropy method were not signi cantly different. Our maximum-entropy method with the bottom-up method achieved the highest accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="855" end_page="856" type="sub_section">
      <SectionTitle>
4.5 Corpus Size and Output Accuracy
</SectionTitle>
      <Paragraph position="0"> In general, using more training data improves the accuracy of outputs and using less data results in low accuracy. Our experiment has the problem that the training corpus was small. To study the relation between training corpus size and accuracy, we experimented using different training corpus sizes and compared accuracy of the output.</Paragraph>
      <Paragraph position="1"> Figure 9 shows the relations between training corpus size and three scores, F-measures, bigram F-measures and BLEU scores, when we used the maximum entropy method with the bottom-up method. This graph suggests that the accuracy in- null and evaluation score.</Paragraph>
      <Paragraph position="2"> creases when the corpus size is increased. Over about 600 sentences, the increase becomes slower. The graph shows that the training corpus was large enough for this study. However, if we introduced other speci c features, such as lexical features, a larger corpus would be required.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML