File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2103_metho.xml
Size: 24,005 bytes
Last Modified: 2025-10-06 14:10:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2103"> <Title>Discourse Generation Using Utility-Trained Coherence Models</Title> <Section position="4" start_page="803" end_page="806" type="metho"> <SectionTitle> 2 Stochastic Models of Discourse </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="803" end_page="803" type="sub_section"> <SectionTitle> Coherence 2.1 Local Models of Discourse Coherence </SectionTitle> <Paragraph position="0"> Stochastic local models of coherence work under the assumption that well-formed discourse can be characterized in terms of specific distributions of local recurring patterns. These distributions can be defined at the lexical level or entity-based levels.</Paragraph> <Paragraph position="1"> Word-Coocurrence Coherence Models. We propose a new coherence model, inspired by (Knight, 2003), that models the intuition that the usage of certain words in a discourse unit (sentence) tends to trigger the usage of other words in subsequent discourse units. (A similar intuition holds for the Machine Translation models generically known as the IBM models (Brown et al., 1993), which assume that certain words in a source language sentence tend to trigger the usage of certain words in a target language translation of that sentence.) We train models able to recognize local recurring patterns of word usage across sentences in an unsupervised manner, by running an Expectation-Maximization (EM) procedure over pairs of consecutive sentences extracted from a large collection of training documents1. We expect EM to detect and assign higher probabilities to recurring word patterns compared to casually occurring word patterns.</Paragraph> <Paragraph position="2"> A local coherence model based on IBM Model 1 assigns the following probability to a text a0 consisting of a1 sentences a2a4a3a6a5a7a2a9a8a10a5a6a11a6a11a6a11a9a5a7a2a13a12 :</Paragraph> <Paragraph position="4"> We call the above equation the direct IBM Model 1, as this model considers the words in sen-</Paragraph> <Paragraph position="6"> events) as being generated by the words in sentence a2 a35 (the a2 a55a35 events, which include the special a2 a56a35 event called the NULL word), with probability a58 a24 a2</Paragraph> <Paragraph position="8"> This model considers the words in sentence a2 a35 (the</Paragraph> <Paragraph position="10"> events, which include the spe-</Paragraph> <Paragraph position="12"> Entity-based Coherence Models. Barzilay and Lapata (2005) recently proposed an entity-based coherence model that aims to learn abstract coherence properties, similar to those stipulated by Centering Theory (Grosz et al., 1995). Their model learns distribution patterns for transitions between discourse entities that are abstracted into their syntactic roles - subject (S), object (O), other (X), missing (-). The feature values are computed using an entity-grid representation for the discourse that records the syntactic role of each entity as it appears in each sentence. Also, salient entities are differentiated from casually occurring entities, based on the widely used assumption that occurrence frequency correlates with discourse prominence (Morris and Hirst, 1991; Grosz et al., 1995). We exclude the coreference information from this model, as the discourse ordering problem cannot accommodate current coreference solutions, which assume a pre-specified order (Ng, 2005).</Paragraph> <Paragraph position="13"> In the jargon of (Barzilay and Lapata, 2005), the model we implemented is called Syntax+Salience.</Paragraph> <Paragraph position="14"> The probability assigned to a text a0a65a28a65a2a66a3a34a11a6a11a6a11a46a2a9a12 by this Entity-Based model (henceforth called EB) can be locally computed (i.e., at sentence transition level) using a67 feature functions, as follows:</Paragraph> <Paragraph position="16"> weights trained to discriminate between coherent, human-authored documents and examples assumed to have lost some degree of coherence (scrambled versions of the original documents).</Paragraph> </Section> <Section position="2" start_page="803" end_page="804" type="sub_section"> <SectionTitle> 2.2 Global Models of Discourse Coherence </SectionTitle> <Paragraph position="0"> Barzilay and Lee (2004) propose a document content model that uses a Hidden Markov Model (HMM) to capture more global aspects of coherence. Each state in their HMM corresponds to a distinct &quot;topic&quot;. Topics are determined by an unsupervised algorithm via complete-link clustering, and are written asa0 a35 , witha0 a35a2a1a4a3 .</Paragraph> <Paragraph position="1"> The probability assigned to a text a0a31a28 a2a66a3a34a11a6a11a6a11a46a2a13a12 by this Content Model (henceforth called CM) can be written as follows: The first term, a14a21a17a19a17 , models the probability of changing from topica0 a35 a32a34a3 to topica0 a35 . The second term, a14a18a17 a19 , models the probability of generating sentences from topica0 a35 .</Paragraph> </Section> <Section position="3" start_page="804" end_page="804" type="sub_section"> <SectionTitle> 2.3 Combining Local and Global Models of Discourse Coherence </SectionTitle> <Paragraph position="0"> We can model the probability a14 a24 a0a61a26 of a text a0 using a log-linear model that combines the discourse coherence models presented above. In this framework, we have a set of a67 feature functionsa22a24a23 a24 a0a61a26 , a50a26a25a28a27a29a25 a67 . For each feature function, there exists a model parameter a30a23 , a50a31a25a32a27 a25 a67 . The probability a14 a24 a0a61a26 can be written under the log-linear model as follows:</Paragraph> <Paragraph position="2"> Under this model, finding the most probable text a0 is equivalent with solving Equation 1, and therefore we do not need to be concerned about computing expensive normalization factors.</Paragraph> <Paragraph position="4"> In this framework, we distinguish between the modeling problem, which amounts to finding appropriate feature functions for the discourse coherence task, and the training problem, which amounts to finding appropriate values fora30a65a23 , a50a66a25 a27 a25 a67 . We address the modeling problem by using as feature functions the discourse coherence models presented in the previous sections. In Section 3, we address the training problem by performing a discriminative training procedure of the a30a19a23 parameters, using as utility functions a metric that measures how different a training instance is from a given reference.</Paragraph> <Paragraph position="5"> The algorithms we propose use as input representation the IDL-expressions formalism (Nederhof and Satta, 2004; Soricut and Marcu, 2005). We use here the IDL formalism (which stands for Interleave, Disjunction, Lock, after the names of its operators) to define finite sets of possible discourses over given discourse units. Without losing generality, we will consider sentences as discourse units in our examples and experiments.</Paragraph> </Section> <Section position="4" start_page="804" end_page="805" type="sub_section"> <SectionTitle> 3.1 Input Representation </SectionTitle> <Paragraph position="0"> Consider the discourse units A-C presented in Figure 1(a). Each of these units undergoes various processing stages in order to provide the information needed by our coherence models. The entity-based model (EB) (Section 2), for instance, makes use of a syntactic parser to determine the syntactic role played by each detected entity (Figure 1(b)). For example, the string SSXXXXOX- null - - - - - - - - - - - (first row of the grid in Figure 1(b), corresponding to discourse unit A) encodes thata71a72 anda73a75a74a77a76a79a78a24a80a82a81a84a83a85a72a86a73a78a44a74 have subject (S) role,a73a75a74a88a87a86a89a90a80a91a73a92a94a93 , etc. have other (X) roles,a83a50a80a95a92a16a83 has object (O) role, and the rest of the entities do not appear (-) in this unit.</Paragraph> <Paragraph position="1"> In order to be able to solve Equation 1, the input representation needs to provide the necessary information to compute alla22a96a23 terms, that is, all individual model scores. Textual units A, B,</Paragraph> <Paragraph position="3"> and C in our example are therefore represented as terms a67 a5a69a68 , anda70 , respectively2 (Figure 1(c)). These terms act like building blocks for IDLexpressions, as in the following example:</Paragraph> <Paragraph position="5"> a7 (Interleave) operator to create a bag-of-units representation. That is, E stands for the set of all possible order permutations ofa67 a5a69a68 , and a70 , with the additional information that any of these orders are to appear between the beginning a0a2a1a14a3 and end of document a0a12a11a13a1a4a3 . An equivalent representation, called IDL-graphs, captures the same information using vertices and edges, which stand in a direct correspondence with the operators and atomic symbols of IDL-expressions. For instance, each a20 and a21 -labeled edge a22 -pair, and their source and target vertices, respectively, correspond to a a22 -argument a7 operator. In Figure 2, we show the IDL-graph corresponding to IDL-expression a15 .</Paragraph> </Section> <Section position="5" start_page="805" end_page="806" type="sub_section"> <SectionTitle> 3.2 Search Algorithms </SectionTitle> <Paragraph position="0"> Algorithms that operate on IDL-graphs have been recently proposed by Soricut and Marcu (2005).</Paragraph> <Paragraph position="1"> We extend these algorithms to take as input IDL-graphs over non-atomic symbols (such that the coherence models can operate inside terms likea67 a5a69a68 , and a70 from Figure 1), and also to work under models with hidden variables such as CM (Section 2.2).</Paragraph> <Paragraph position="2"> These algorithm, called IDL-CH-A a1 (Aa1 search for IDL-expressions under Coherence models) and IDL-CH-HBa23 (Histogram-Based beam search for IDL-expressions under Coherence models, with histogram beam a24 ), assume an alphabet a51 of non-atomic (visible) variables (over which the input IDL-expressions are defined), and an alphabet a3 of hidden variables. They unfold an input IDL-graph on-the-fly, as follows: starting from the</Paragraph> <Paragraph position="4"> , the input graph is traversed in an IDL-specific manner, by creating states which 2Following Barzilay and Lee (2004), proper names, dates, and numbers are replaced with generic tokens.</Paragraph> <Paragraph position="5"> keep track of a22 positions in any subgraph corresponding to a a22 -argument a7 operator, as well as the last edge traversed and the last hidden variable considered. For instance, state a26 a28</Paragraph> <Paragraph position="7"> a26 (see the blackened vertices in Figure 2) records that expressions a68 and a70 have already been considered (while a67 is still in the future of state a26 ), anda70 was the last one considered, evaluated under the hidden variablea0 a35 . The information recorded in each state allows for the computation of a current coherence cost under any of the models described in Section 2. In what follows, we assume this model to be the model from Equation 1, since each of the individual models can be obtained by setting the othera30 s to 0.</Paragraph> <Paragraph position="8"> We also define an admissible heuristic function (Russell and Norvig, 1995), which is used to compute an admissible future cost a33 for state a34 , using the following equation:</Paragraph> <Paragraph position="10"> a59 is the set of future (visible) events for state a34 , which can be computed directly from an input IDL-graph, as the set of all a51 -edge-labels between the vertices of state a34 and final vertex a25a30a60 . For example, for state a26 a28</Paragraph> <Paragraph position="12"> conditions for state a34 , which can be obtained from a59 (any non-final future event may become a future conditioning event), by eliminating a0a12a11a13a1a14a3 and adding the current conditioning event of a34 . For the considered example state a26 , we have a66 a28a16a61a77a67 a5a70a67a64 . The value a33 a34 a26 is admissible because, for each future event a0 a73 a5a0 a35 a3 , with a73 a1 a59 anda0 a35 a1a4a3 , its cost is computed using the most inexpensive conditioning event a0a69a68a53a70a33a5a0 a45 a3 a1 a66 a20 a3 . The IDL-CH-Aa1 algorithm uses a priority queue a71 (sorted according to total cost, computed as current a49 admissible) to control the unfolding of an input IDL-graph, by processing, at each unfolding step, the most inexpensive state (extracted from the top of a71 ). The admissibility of the future costs and the monotonicity property enforced by the priority queue guarantees that IDL-CH-A a1 finds an optimal solution to Equation 1 (Russell and Norvig, 1995).</Paragraph> <Paragraph position="13"> The IDL-CH-HBa23 algorithm uses a histogram beam a24 to control the unfolding of an input IDLgraph, by processing, at each unfolding step, the top a24 most inexpensive states (according to total cost). This algorithm can be tuned (via a24 ) to achieve good trade-off between speed and accuracy. We refer the reader to (Soricut, 2006) for additional details regarding the optimality and the theoretical run-time behavior of these algorithms.</Paragraph> </Section> <Section position="6" start_page="806" end_page="806" type="sub_section"> <SectionTitle> 3.3 Utility-based Training </SectionTitle> <Paragraph position="0"> In addition to the modeling problem, we must also address the training problem, which amounts to finding appropriate values for the a30a23 parameters from Equation 1.</Paragraph> <Paragraph position="1"> The solution we employ here is the discriminative training procedure of Och (2003). This procedure learns an optimal setting of the a30a65a23 parameters using as optimality criterion the utility of the proposed solution. There are two necessary ingredients to implement Och's (2003) training procedure. First, it needs a search algorithm that is able to produce ranked a22 -best lists of the most promising candidates in a reasonably fast manner (Huang and Chiang, 2005). We accommodate a22 -best computation within the IDL-CH-HBa3 a56a72a56 algorithm, which decodes bag-of-units IDL-expressions at an average speed of 75.4 sec./exp. on a 3.0 GHz CPU Linux machine, for an average input of 11.5 units per expression.</Paragraph> <Paragraph position="2"> Second, it needs a criterion which can automatically assess the quality of the proposed candidates. To this end, we employ two different metrics, such that we can measure the impact of using different utility functions on performance.</Paragraph> <Paragraph position="3"> TAU (Kendall's a0 ). One of the most frequently used metrics for the automatic evaluation of document coherence is Kendall's a0 (Lapata, 2003; Barzilay and Lee, 2004). TAU measures the minimum number of adjacent transpositions needed to transform a proposed order into a reference order.</Paragraph> <Paragraph position="4"> The range of the TAU metric is between -1 (the worst) to 1 (the best).</Paragraph> <Paragraph position="5"> BLEU. One of the most successful metrics for judging machine-generated text is BLEU (Papineni et al., 2002). It counts the number of unigram, bigram, trigram, and four-gram matches between hypothesis and reference, and combines them using geometric mean. For the discourse ordering problem, we represent hypotheses and references by index sequences (e.g., &quot;4 2 3 1&quot; is a hypothesis order over four discourse units, in which the first and last units have been swapped with respect to the reference order). The range of BLEU scores is between 0 (the worst) and 1 (the best).</Paragraph> <Paragraph position="6"> We run different discriminative training sessions using TAU and BLEU, and train two different sets of thea30a43a23 parameters for Equation 1. The log-linear models thus obtained are called Loglineara1a3a2a5a4a7a6a9a8a11a10 and Log-lineara1a12a2a13a4a15a14a17a16a19a18a9a10 , respectively.</Paragraph> </Section> </Section> <Section position="5" start_page="806" end_page="808" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We evaluate empirically two different aspects of our work. First, we measure the performance of our search algorithms across different models.</Paragraph> <Paragraph position="1"> Second, we compare the performance of each individual coherence model, and also the performance of the discriminatively trained log-linear models.</Paragraph> <Paragraph position="2"> We also compare the overall performance (model & decoding strategy) obtained in our framework with previously reported results.</Paragraph> <Section position="1" start_page="806" end_page="807" type="sub_section"> <SectionTitle> 4.1 Evaluation setting </SectionTitle> <Paragraph position="0"> The task on which we conduct our evaluation is information ordering (Lapata, 2003; Barzilay and Lee, 2004; Barzilay and Lapata, 2005). In this task, a pre-selected set of information-bearing document units (in our case, sentences) needs to be arranged in a sequence which maximizes some specific information quality (in our case, document coherence). We use the information-ordering task as a means to measure the performance of our algorithms and models in a well-controlled setting.</Paragraph> <Paragraph position="1"> As described in Section 3, our framework can be used in applications such as multi-document summarization. In fact, Barzilay et al. (2002) formulate the multi-document summarization problem as an information ordering problem, and show that naive ordering algorithms such as majority ordering (select most frequent orders across input documents) and chronological ordering (order facts according to publication date) do not always yield coherent summaries.</Paragraph> <Paragraph position="2"> Data. For training and testing, we use documents from two different genres: newspaper articles and accident reports written by government officials (Barzilay and Lapata, 2005). The first collection (henceforth called EARTHQUAKES) ument coherence, for both EARTHQUAKES and ACCIDENTSgenre, using IDL-CH-HBa3 a56a72a56 .</Paragraph> <Paragraph position="3"> Board's database.</Paragraph> <Paragraph position="4"> For both collections, we used 100 documents for training and 100 documents for testing. A fraction of 40% of the training documents was temporarily removed and used as a development set, on which we performed the discriminative training procedure.</Paragraph> </Section> <Section position="2" start_page="807" end_page="808" type="sub_section"> <SectionTitle> 4.2 Evaluation of Search Algorithms </SectionTitle> <Paragraph position="0"> We evaluated the performance of several search algorithms across four stochastic models of document coherence: the IBMa0 coherence models, the content model of Barzilay and Lee (2004) (CM), and the entity-based model of Barzilay and Lapata (2005) (EB) (Section 2). We measure search performance using an Estimated Search Error (ESE) figure, which reports the percentage of times when the search algorithm proposes a sentence order which scores lower than fected by both model & search procedure) of our framework with previous results.</Paragraph> <Paragraph position="1"> the original sentence order (OSO). We also measure the quality of the proposed documents using TAU and BLEU, using as reference the OSO.</Paragraph> <Paragraph position="2"> In Table 1, we report the performance of four search algorithms. The first three, IDL-CH-A a1 , IDL-CH-HBa3 a56a72a56 , and IDL-CH-HBa3 are the IDL-based search algorithms of Section 3, implementing Aa1 search, histogram beam search with a beam of 100, and histogram beam search with a beam of 1, respectively. We compare our algorithms against the greedy algorithm used by Lapata (2003). We note here that the comparison is rendered meaningful by the observation that this algorithm performs search identically with algorithm IDL-CH-HBa3 (histogram beam 1), when setting the heuristic function for future costs a33 to constant 0.</Paragraph> <Paragraph position="3"> The results in Table 1 clearly show the superiority of the IDL-CH-Aa1 and IDL-CH-HBa3 a56a72a56 algo- null rithms. Across all models considered, they consistently propose documents with scores at least as good as OSO (0% Estimated Search Error). As the original documents were coherent, it follows that the proposed document realizations also exhibit coherence. In contrast, the greedy algorithm of Lapata (2003) makes grave search errors. As the comparison between IDL-CH-HBa3 a56a72a56 and IDL-CH-HBa3 shows, the superiority of the IDL-CH algorithms depends more on the admissible heuristic function a33 than in the ability to maintain multiple hypotheses while searching.</Paragraph> </Section> <Section position="3" start_page="808" end_page="808" type="sub_section"> <SectionTitle> 4.3 Evaluation of Log-linear Models </SectionTitle> <Paragraph position="0"> For this round of experiments, we held constant the search procedure (IDL-CH-HBa3 a56a72a56 ), and varied the a30a19a23 parameters of Equation 1. The utility-trained log-linear models are compared here against a baseline log-linear model loglineara1a3a2a3a4 a5a7a6a9a8a11a10 , for which all a30a43a23 parameters are set to 1, and also against the individual models. The results are presented in Table 2.</Paragraph> <Paragraph position="1"> If not properly weighted, the log-linear combination may yield poorer results than those of individual models (average TAU of .34 for loglineara1a3a2a3a4 a5a7a6a9a8a11a10 , versus .38 for IBMa0 and .39 for CM, on the EARTHQUAKESdomain). The highest TAU accuracy is obtained when using TAU to perform utility-based training of the a30a65a23 parameters (.47 for EARTHQUAKES, .50 for ACCIDENTS).</Paragraph> <Paragraph position="2"> The highest BLEU accuracy is obtained when using BLEU to perform utility-based training of the a30a19a23 parameters (.16 for EARTHQUAKES, .24 for theACCIDENTS). For both genres, the differences between the highest accuracy figures (in bold) and the accuracy of the individual models are statistically significant at 95% confidence (using bootstrap resampling).</Paragraph> </Section> <Section position="4" start_page="808" end_page="808" type="sub_section"> <SectionTitle> 4.4 Overall Performance Evaluation </SectionTitle> <Paragraph position="0"> The last comparison we provide is between the performance provided by our framework and previously-reported performance results (Table 3). We are able to provide this comparison based on the TAU figures reported in (Barzilay and Lee, 2004). The training and test data for both genres is the same, and therefore the figures can be directly compared. These figures account for combined model and search performance.</Paragraph> <Paragraph position="1"> We first note that, unfortunately, we failed to accurately reproduce the model of Barzilay and Lee (2004). Our reproduction has an average TAU figure of only .39 versus the original figure of .81 for EARTHQUAKES, and .36 versus .44 for ACCIDENTS. On the other hand, we reproduced successfully the model of Barzilay and Lapata (2005), and the average TAU figure is .19 for EARTHQUAKES, and .12 for ACCIDENTS3. The large difference on the EARTHQUAKEScorpus between the performance of Barzilay and Lee (2004) and our reproduction of their model is responsible for the overall lower performance (0.47) of our log-lineara10a13a12a15a14a17a16a9a18a20a19 model and IDL-CH-HBa3 a56a72a56 search algorithm, which is nevertheless higher than that of its component model CM (0.39). On the other hand, we achieve the highest accuracy figure (0.50) on the ACCIDENTS corpus, outperforming the previous-highest figure (0.44) of Barzilay and Lee (2004). These result empirically show that utility-trained log-linear models of discourse coherence outperform each of the individual coherence models considered.</Paragraph> </Section> </Section> class="xml-element"></Paper>