File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2109_intro.xml

Size: 5,191 bytes

Last Modified: 2025-10-06 14:03:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2109">
  <Title>Trimming CFG Parse Trees for Sentence Compression Using Machine Learning Approaches</Title>
  <Section position="3" start_page="850" end_page="851" type="intro">
    <SectionTitle>
2 Background
2.1 The Noisy-Channel Model for Sentence
Compression
</SectionTitle>
    <Paragraph position="0"> Knight and Marcu proposed a sentence compression method using a noisy-channel model (Knight and Marcu, 2000). This model assumes that a long sentence was originally a short one and that the longer sentence was generated because some unnecessary words were added. Given a long sentence l, it nds a short sentence s that maximizes P(sjl). This is equivalent to nding the s that maximizes P(s) P(ljs) in Bayes' Rule.</Paragraph>
    <Paragraph position="1"> The expression P(s) is the source model, which gives the probability that s is the original short string. When s is ungrammatical, P(s) becomes small. The expression P(ljs) is the channel model, which gives the probability that s is expanded to l. When s does not include important words of l, P(ljs) has a low value.</Paragraph>
    <Paragraph position="2"> In the Knight and Marcu's model, a probabilistic context-free grammar (PCFG) score and a word-bigram score are incorporated as the source model. To estimate the channel model, Knight and Marcu used the Ziff-Davis parallel corpus, which contains long sentences and corresponding short sentences compressed by humans. Note that each compressed sentence is a subsequence of the corresponding original sentence. They rst parse both the original and compressed sentences using a CFG parser to create parse trees. When two nodes of the original and compressed trees have the same non-terminals, and the daughter nodes of the compressed tree are a subsequence of the original tree, they count the node pair as a joint event. For example, in Figure 1, the original parse tree contains a rule rl = (B ! D E F), and the compressed parse tree contains rs = (B ! D F).</Paragraph>
    <Paragraph position="3"> They assume that rs was expanded into rl, and count the node pairs as joint events. The expansion probability of two rules is given by:</Paragraph>
    <Paragraph position="5"> Finally, new subtrees grow from new daughter nodes in each expanded node. In Figure 1, (E (G g) (H h)) grows from E. The PCFG scores, Pcfg, of these subtrees are calculated.</Paragraph>
    <Paragraph position="6"> Then, each probability is assumed to be independent of the others, and the channel model, P(ljs), is calculated as the product of all expansion probabilities of joint events and PCFG scores of new</Paragraph>
    <Paragraph position="8"> where R is the set of rule pairs, and Rprime is the set of generation rules in new subtrees.</Paragraph>
    <Paragraph position="9"> To compress an input sentence, they create a tree with the highest score of all possible trees.</Paragraph>
    <Paragraph position="10"> They pack all possible trees in a shared-forest structure (Langkilde, 2000). The forest structure is represented by an AND-OR tree, and it contains many tree structures. The forest representation saves memory and makes calculation faster because the trees share sub structures, and this can reduce the total number of calculations.</Paragraph>
    <Paragraph position="11"> They normalize each log probability using the length of the compressed sentence; that is, they divide the log probability by the length of the compressed sentence.</Paragraph>
    <Paragraph position="12"> Turner and Charniak (Turner and Charniak, 2005) added some special rules and applied this method to unsupervised learning to overcome the lack of training data. However their model also has the same problem. McDonald (McDonald, 2006) independently proposed a new machine learning approach. He does not trim input parse trees but uses rich features about syntactic trees and improved performance.</Paragraph>
    <Section position="1" start_page="850" end_page="851" type="sub_section">
      <SectionTitle>
2.2 Maximum Entropy Model
</SectionTitle>
      <Paragraph position="0"> The maximum entropy model (Berger et al., 1996) estimates a probability distribution from training data. The model creates the most uniform distribution within the constraints given by users. The distribution with the maximum entropy is considered the most uniform.</Paragraph>
      <Paragraph position="1"> Given two nite sets of event variables, X and Y, we estimate their joint probability distribution, P(x,y). An output, y (2 Y), is produced, and  contextual information, x (2 X), is observed. To represent whether the event (x,y) satis es a certain feature, we introduce a feature function. A feature function fi returns 1 iff the event (x,y) satis es the feature i and returns 0 otherwise.</Paragraph>
      <Paragraph position="2"> Given training data f(x1,y1), , (xn,yn)g, we assume that the expectation of fi on the distribution of the model conforms to that on the empirical probability distribution ~P(x,y). We select the probability distribution that satis es these constraints of all feature functions and maximizes its entropy, H(P) = summationtextx,y P(x,y) log (P(x,y)).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML