File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1628_metho.xml
Size: 13,026 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1628"> <Title>2Information and Communication Technologies</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Using The Viterbi Algorithm for Sentence Generation </SectionTitle> <Paragraph position="0"> We assume that the reader is familiar with the Viterbi algorithm. The interested reader is referred to Manning and Schutze [1999] for a more complete description. Here, we summarise our re-implementation (described in [Wan et al., 2003]) of the Viterbi algorithm for summary sentence generation, as rst introduced by Witbrock and Mittal [1999].</Paragraph> <Paragraph position="1"> In this work, we begin with a Hidden Markov Model (HMM) where the nodes (ie, states) of the graph are uniquely labelled with words from a relevant vocabulary. To obtain a suitable subset of the vocabulary, words are taken from a set of related sentences, such as those that might occur in a news article (as is the case for the original work by Witbrock and Mittal). In this work, we use the clusters of event related sentences from the Information Fusion work by Barzilay et al.</Paragraph> <Paragraph position="2"> [1999]. The edges between nodes in the HMM are typically weighted using bigram probabilities extracted from a related corpus.</Paragraph> <Paragraph position="3"> The three probabilities of the unmodi ed Viterbi algorithm are de ned as follows: Transition Probability (using the Maximum Likelihood Estimate to model bigram probabilities)1:</Paragraph> <Paragraph position="5"> Emission Probability: (For the purposes of testing the new transition probability function described in Section 4, this is set to 1 in this paper):</Paragraph> <Paragraph position="7"> The unmodi ed Viterbi algorithm as outlined here would generate word sequences just using a bigram model. As noted above, such sequences will often be ungrammatical.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 A Mechanism for Propagating Dependency </SectionTitle> <Paragraph position="0"> Features in the Extended Viterbi Algorithm In our extension, we modify the de nition of the Transition Probability such that not only do we consider bigram probabilities but also dependency-based transition probabilities. Examining the dependency head of the preceding string then allows us to consider long-distance context when appending a new word. The algorithm ranks highly those words with a plausible dependency relation to the preceding string, with respect to the source text being generated from (or summarised). null However, instead of considering just the likelihood of a dependency relation between adjacent pairs of words, we can consider the likelihood of a word attaching to the dependency tree structure of the partially generated sentence. Speci cally, it is the rightmost root-to-leaf branch that can still be modi ed or governed by the appending of a new word to the string.</Paragraph> <Paragraph position="1"> This rightmost branch is stored as a stack. It is updated and propagated to the end of the path each time we add a word.</Paragraph> <Paragraph position="2"> Thus, our extension has two components: Dependency Transition and Head Stack Reduction. Aside from these modi cations, the Viterbi algorithm remains the same.</Paragraph> <Paragraph position="3"> In the remaining subsections, we describe in detail how the dependency relations are computed and how the stack is reduced. In Figure 3, we present pseudo-code for the extended Viterbi algorithm.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Scoring a Dependency Transition Dependency Parse Preprocessing of Source Text </SectionTitle> <Paragraph position="0"> The Dependency Transition is simply an additional weight on the HMM edge. The transition probability is the average of the two transition weights based on bigrams and dependencies: null</Paragraph> <Paragraph position="2"> Before we begin the generation process, we rst use a dependency parser to parse all the sentences from the source text to 1Here the subscripts refer to the fact that this is a transition probability based on n-grams. We will later propose an alternative using dependency transitions.</Paragraph> <Paragraph position="3"> obtain dependency trees. A traversal of each dependency tree yields all parent-child relationships, and we update an adjacency matrix of connectivity accordingly. Because the status of a word as a head or modi er depends on the word order in English, we consider relative word positions to determine if a relation has a forward or backward2 direction. Forward and backward directional relations are stored in separate matrices. The Forward matrix stores relations in which the head is to the right of modi er in the sentence. Conversely, the Backward matrix stores relations in the head to left of the modi er. This distinction is required later in the stack reduction step.</Paragraph> <Paragraph position="4"> As an example, given the two strings (using characters in lieu of words) d b e a c and b e d c a and the corresponding dependency trees:</Paragraph> <Paragraph position="6"> We refer to the matrices as Adjright and Adjleft respectively.</Paragraph> <Paragraph position="7"> The cell value in each matrix indicates the number of times word i (that is, the row index) governs word j (that is, the column index).</Paragraph> <Paragraph position="8"> Computing the Dependency Transition Probability We de ne the Dependency Transition weight as:</Paragraph> <Paragraph position="10"> where Depsym is the symmetric relation stating that some dependency relation occurs between a word and any of the words in the stack, irrespective of which is the head. Intuitively, the stack is a compressed representation of the dependency tree corresponding to the preceding words. The probability indicates how likely it is that the new word can attach itself to this incrementally built dependency tree, either as a modi er or a governer. Since the stack is cumulatively passed on at each point, we need only consider the stack stored at the preceding word.</Paragraph> <Paragraph position="11"> This is estimated as follows:</Paragraph> <Paragraph position="13"> tory Categorial Grammar [Steedman, 2000].</Paragraph> <Paragraph position="14"> the quick brown fox jumps jumps over the lazy dog .</Paragraph> <Paragraph position="15"> two lines, it represents a single sequence of words. The stack (oriented upwards) grows and shrinks as we add words. Note that the modi ers to dog are popped off before it is pushed on. Note also that modi ers of existing items on the stack, such as over are merely pushed on. Words with no connection to previously seen stack items are also pushed on (eg. quick) in the hope that a head will be found later.</Paragraph> <Paragraph position="16"> Here, we assume that a word can only attach to the tree once at a single node; hence, we nd the node that maximises the probability of node attachment. The relationship Depsym(a; b) is modelled using a simpli ed version of Collins' [1996] dependency model.</Paragraph> <Paragraph position="17"> Because of the status of word as the head relies on thepreservation of word order, we keep track of the directionality of a relation. For two words a and b where a precedes b in the generated string, p(Depsym(a;b))...</Paragraph> <Paragraph position="19"> where Adjright and Adjleft are the right and left adjacency matrices. Recall that row indices are heads and column indices are modi ers.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Head Stack Reduction </SectionTitle> <Paragraph position="0"> Once we decide that a newly considered path is better than any other previously considered one, we update the head stack to represent the extended path. At any point in time, the stack represents the rightmost root-to-leaf branch of the dependency tree (for the generated sentence) that can still be modi ed or governed by concatenating new words to the string.3 Within the stack, older words may be modi ed by newer words. Our rules for modifying the stack are designed to cater for a projective4 dependency grammar.</Paragraph> <Paragraph position="1"> There are three possible alternative outcomes of the reduction. The rst is that the proposed top-of-stack (ToS) has no dependency relation to any of the existing stack items, in which case the stack remains unchanged. For the second and third cases, we check each item on the stack and keep a record only of the best probable dependency between the proposed ToS and the appropriate stack item. The second outcome, then, is that the proposed ToS is the head of some item on the stack. All items up to and including that stack item are popped off and the proposed ToS is pushed on. The third outcome is that it modi es some item on the stack. All stackitems up to (but not including) the stack item are popped off and the proposed ToS is pushed on. The pseudocode is presented in Figure 2. An example of stack manipulation is presented in Figure 1. We rely on two external functions. The rst function, depsym=2, has already been presented above.</Paragraph> <Paragraph position="2"> The second function, isReduced=2, relies on an auxiliary function returning the probability of one word being governed by the other, given the relative order of the words. This is in essence our parsing step, determining which word governs the other. The function is de ned as follows:</Paragraph> <Paragraph position="4"> where hasRelation=2 is the number of times we see the two words in a dependency relation, and where i(wi) returns a word position in the corpus sentence. The function isReduced=2 makes calls to p(isHeadRight=2) andp(isHeadLeft=2). It returns true if the rst parameter viterbiSearch(maxLength, stateGraph) returns bestPath is the head of the second, and false otherwise. In the comparison, the denominator is constant. We thus need only the numerator in these auxiliary functions.</Paragraph> <Paragraph position="5"> Collins' distance heuristics [1996] weight the probability of a dependency relation between two words based on the distance between them. We could implement a similar strategy by favouring small reductions in the head stack. Thus a reduction with a more recent stack item which is closer to the proposed ToS would be less penalised than an older one.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> There is a wealth of relevant research related to sentence generation. We focus here on a discussion of related work from statistical sentence generation and from summarisation.</Paragraph> <Paragraph position="1"> In recent years, there has been a steady stream of research in statistical text generation. We focus here on work which generates sentences from some sentential semantic representation via a statistical method. For examples of related statistical sentence generators see Langkilde and Knight [1998] and Bangalore and Rambow [2000]. These approaches begin with a representation of sentence semantics that closely resembles that of a dependency tree. This semantic representation is turned into a word lattice. By ranking all traversals of this lattice using an n-gram model, the best surface realisation of the semantic representation is chosen. The system then searches for the best path through this lattice. Our approach differs in that we do not start with a semantic representation.</Paragraph> <Paragraph position="2"> Instead, we paraphrase the original text. We search for the best word sequence and dependency tree structure concurrently. null Research in summarisation has also addressed the problem of generating non-verbatim sentences; see [Jing and McKeown, 1999], [Barzilay et al., 1999] and more recently [Daum*e III and Marcu, 2004]. Jing presented a HMM for learning alignments between summary and source sentences trained using examples of summary sentences generated by humans. Daume III also provides a mechanism for sub-sentential alignment but allows for alignments between multiple sentences. Both these approaches provide models for later recombining sentence fragments. Our work differs primarily in granularity. Using words as a basic unit potentially offers greater exibility in pseudo-paraphrase generation since we able to modify the word sequence within the phrase.</Paragraph> <Paragraph position="3"> It should be noted, however, that a successful execution of our algorithm is likely to conserve constituent structure (ie. a coarser granularity) via the use of dependencies, whilst still making available a exibility at the word level. Additionally, our use of dependencies allows us to generate not only a string but a dependency tree for that sentence.</Paragraph> </Section> class="xml-element"></Paper>