File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1051_metho.xml

Size: 25,468 bytes

Last Modified: 2025-10-06 14:08:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1051">
  <Title>Computing Locally Coherent Discourses</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Discourse Ordering Problem
</SectionTitle>
    <Paragraph position="0"> We will first give a formal definition of the problem of computing locally coherent discourses, and demonstrate how some local coherence measures from the literature fit into this framework.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Definitions
</SectionTitle>
      <Paragraph position="0"> We assume that a discourse is made up of discourse units (depending on the underlying theory, these could be utterances, sentences, clauses, etc.), which must be ordered to achieve maximum local coherence. We call the problem of computing the optimal ordering the discourse ordering problem.</Paragraph>
      <Paragraph position="1"> We formalise the problem by assigning a cost to each unit-to-unit transition, and a cost for the discourse to start with a certain unit. Transition costs may depend on the local context, i.e. a fixed number of discourse units to the left may influence the cost of a transition. The optimal ordering is the one which minimises the sum of the costs.</Paragraph>
      <Paragraph position="2"> Definition 1. A d-place transition cost function for a set U of discourse units is a function cT : Ud a82. Intuitively, cT(un|u1,...,ud[?]1) is the cost of the transition (ud[?]1,ud) given that the immediately preceding units were u1,...,ud[?]2.</Paragraph>
      <Paragraph position="3"> A d-place initial cost function for U is a function cI : Ud - a82. Intuitively, cI(u1,...,ud) is the cost for the fact that the discourse starts with the sequence u1,...,ud.</Paragraph>
      <Paragraph position="4"> The d-place discourse ordering problem is defined as follows: Given a set U = {u1,...,un}, a d-place transition cost function cT and a (d [?] 1)place initial cost function cI, compute a permutation</Paragraph>
      <Paragraph position="6"> is minimal.</Paragraph>
      <Paragraph position="7"> The notation for the cost functions is suggestive: The transition cost function has the character of a conditional probability, which specifies that the cost of continuing the discourse with the unit ud depends on the local context u1,...,ud[?]1. This local context is not available for the first d [?] 1 units of the discourse, which is why their costs are summarily covered by the initial function.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Centering-Based Cost Functions
</SectionTitle>
      <Paragraph position="0"> One popular class of coherence measures is based on Centering Theory (CT, (Walker et al., 1998)). We will briefly sketch its basic notions and then show how some CT-based coherence measures can be cast into our framework.</Paragraph>
      <Paragraph position="1"> The standard formulation of CT e.g. in (Walker et al., 1998), calls the discourse units utterances, and assigns to each utterance ui in the discourse a list Cf(ui) of forward-looking centres. The members of Cf(ui) correspond to the referents of the NPs in ui and are ranked in order of prominence, the first element being the preferred centre Cp(ui). The backward-looking centre Cb(ui) of ui is defined as the highest ranked element of Cf(ui) which also appears in Cf(ui[?]1), and serves as the link between the two subsequent utterances ui[?]1 and ui. Each utterance has at most one Cb. If ui and ui[?]1 have no forward-looking centres in common, or if ui is the first utterance in the discourse, then ui does not have a Cb at all.</Paragraph>
      <Paragraph position="2"> Based on these concepts, CT classifies the transitions between subsequent utterances into different types. Table 1 shows the most common classification into the four types CONTINUE, RETAIN, SMOOTH-SHIFT, and ROUGH-SHIFT, which are predicted to be less and less coherent in this order (Brennan et al., 1987). Kibble and Power (2000) define three further classes of transitions: COHERENCE and SALIENCE, which are both defined in Table 1 as well, and NOCB, the class of transitions for which Cb(ui) is undefined. Finally, a transition is considered to satisfy the CHEAPNESS constraint (Strube and Hahn, 1999) if Cb(ui) = Cp(ui[?]1).</Paragraph>
      <Paragraph position="3"> Table 2 summarises some cost functions from the literature, in the reconstruction of Karamanis et al.</Paragraph>
      <Paragraph position="4"> (2004). Each line shows the name of the coherence measure, the arity d from Definition 1, and the initial and transition cost functions. To fit the definitions in one line, we use terms of the form fk, which abbreviate applications of f to the last k arguments of the cost functions, i.e. f(ud[?]k+1,...,ud).</Paragraph>
      <Paragraph position="5"> The most basic coherence measure, M.NOCB (Karamanis and Manurung, 2002), simply assigns to each NOCB transition the cost 1 and to every other transition the cost 0. The definition of cT(u2|u1), which decodes to nocb(u1,u2), only looks at the two units in the transition, and no further context.</Paragraph>
      <Paragraph position="6"> The initial costs for this coherence measure are always zero.</Paragraph>
      <Paragraph position="7"> The measure M.KP (Kibble and Power, 2000) sums the value of nocb and the values of three functions which evaluate to 0 if the transition is cheap, salient, or coherent, and 1 otherwise. This is an instance of the 3-place discourse ordering problem because COHERENCE depends on Cb(ui[?]1), which itself depends on Cf(ui[?]2); hence nocoh must take COHERENCE: COHERENCE[?]:</Paragraph>
      <Paragraph position="9"> SALIENCE: Cb(ui) = Cp(ui) CONTINUE SMOOTH-SHIFT SALIENCE[?]: Cb(ui) negationslash= Cp(ui) RETAIN ROUGH-SHIFT Table 1: COHERENCE, SALIENCE and the table of standard transitions d initial cost cI(u1,...,ud[?]1) transition cost cT(ud|u1,...,ud[?]1)</Paragraph>
      <Paragraph position="11"> three arguments.</Paragraph>
      <Paragraph position="12"> Finally, the measure M.BFP (Brennan et al., 1987) uses a lexicographic ordering on 4-tuples which indicate whether the transition is a CON-TINUE, RETAIN, SMOOTH-SHIFT, or ROUGH-SHIFT. cT and all four functions it is computed from take three arguments because the classification depends on COHERENCE. As the first transition in the discourse is coherent by default (it has no Cb), we can compute cI by distinguishing RETAIN and CONTINUE via SALIENCE. The tuple-valued cost functions can be converted to real-valued functions by choosing a sufficiently large number M and using the value M3 *cont + M2 *ret + M *ss + rs.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Probability-Based Cost Functions
</SectionTitle>
      <Paragraph position="0"> A fundamentally different approach to measure discourse coherence was proposed by Lapata (2003).</Paragraph>
      <Paragraph position="1"> It uses a statistical bigram model that assigns each pair ui,uk of utterances a probability P(uk|ui) of appearing in subsequent positions, and each utterance a probability P(ui) of appearing in the initial position of the discourse. The probabilities are estimated on the grounds of syntactic features of the discourse units. The probability of the entire discourse u1 ...un is the product P(u1) * P(u2|u1) * ...*P(un|un[?]1).</Paragraph>
      <Paragraph position="2"> We can transform Lapata's model straightforwardly into our cost function framework, as shown under M.LAPATA in Table 2. The discourse that minimizes the sum of the negative logarithms will also maximise the product of the probabilities. We have d = 2 because it is a bigram model in which the transition probability does not depend on the previous discourse units.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Equivalence of Discourse Ordering and
TSP
</SectionTitle>
    <Paragraph position="0"> Now we show that discourse ordering and the travelling salesman problem are equivalent. In order to do this, we first redefine discourse ordering as a graph problem.</Paragraph>
    <Paragraph position="1"> d-place discourse ordering problem (dPDOP): Given a directed graph G = (V,E), a node s [?] V and a function c : V d - a82, compute a simple directed path P = (s = v0,v1,...,vn) from s through all vertices in V which minimises summationtextn[?]d+1i=0 c(vi,vi+1,...,vi+d[?]1). We write instances of dPDOP as (V,E,s,c).</Paragraph>
    <Paragraph position="2"> The nodes v1,...,vn correspond to the discourse units. The cost function c encodes both the initial and the transition cost functions from Section 2 by returning the initial cost if its first argument is the (new) start node s.</Paragraph>
    <Paragraph position="3"> Now let's define the version of the travelling salesman problem we will use below.</Paragraph>
    <Paragraph position="4"> Generalised asymmetric TSP (GATSP): Given a directed graph G = (V,E), edge weights c : E - a82, and a partition (V1,...,Vk) of the nodes V , compute the shortest directed cycle that visits exactly one node of each Vi. We call such a cycle a tour and write instances of GATSP as ((V1,...,Vk),E,c).</Paragraph>
    <Paragraph position="5"> The usual definition of the TSP, in which every node must be visited exactly once, is the special case of GATSP where each Vi contains exactly one node. We call this case asymmetric travelling salesman problem, ATSP.</Paragraph>
    <Paragraph position="6">  We will show that ATSP can be reduced to 2PDOP, and that any dPDOP can be reduced to GATSP.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Reduction of ATSP to 2PDOP
</SectionTitle>
      <Paragraph position="0"> First, we introduce the reduction of ATSP to 2PDOP, which establishes NP-completeness of dPDOP for all d &gt; 1. The reduction is approximation preserving, i.e. if we can find a solution of 2PDOP that is worse than the optimum only by a factor of epsilon1 (an epsilon1-approximation), it translates to a solution of ATSP that is also an epsilon1-approximation.</Paragraph>
      <Paragraph position="1"> Since it is known that there can be no polynomial algorithms that compute epsilon1-approximations for general ATSP, for any epsilon1 (Cormen et al., 1990), this means that dPDOP cannot be approximated either (unless P=NP): Any polynomial algorithm for dPDOP will compute arbitrarily bad solutions on certain inputs.</Paragraph>
      <Paragraph position="2"> The reduction works as follows. Let G = ((V1,...,Vk),E,c) be an instance of ATSP, and</Paragraph>
      <Paragraph position="4"> v [?] V and split it into two nodes vs and vt. We assign all edges with source node v to vs and all edges with target node v to vt (compare Figure 1). Finally we make vs the source node of our 2PDOP instance Gprime.</Paragraph>
      <Paragraph position="5"> For every tour in G, we have a path in Gprime starting at vs visiting all other nodes (and ending in vt) with the same cost by replacing the edge (v,u) out of v by (vs,u) and the edge (w,v) into v by (w,vt).</Paragraph>
      <Paragraph position="6"> Conversely, for every path starting at vs visiting all nodes, we have an ATSP tour of the same cost, since all such paths will end in vt (as vt has no outgoing edges).</Paragraph>
      <Paragraph position="7"> An example is shown in Fig. 1. The ATSP instance on the left has the tour (1,3,2,1), indicated by the solid edges. The node 1 is split into the two nodes 1s and 1t, and the tour translates to the path (1s,3,2,1t) in the 2PDOP instance.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Reduction of dPDOP to GATSP
</SectionTitle>
      <Paragraph position="0"> Conversely, we can encode an instance G = (V,E,s,c) of dPDOP as an instance Gprime =</Paragraph>
      <Paragraph position="2"> the source node [s,s] are not drawn.</Paragraph>
      <Paragraph position="3"> ((V primeu)u[?]V ,Eprime,cprime) of GATSP, in such a way that the optimal solutions correspond. The cost of traversing an edge in dPDOP depends on the previous d [?] 1 nodes; we compress these costs into ordinary costs of single edges in the reduction to GATSP.</Paragraph>
      <Paragraph position="4"> The GATSP instance has a node [u1,...,ud[?]1] for every d [?] 1-tuple of nodes of V . It has an edge from [u1,...,ud[?]1] to [u2,...,ud[?]1,ud] iff there is an edge from ud[?]1 to ud in G, and it has an edge from each node into [s,...,s]. The idea is to encode a path P = (s = u0,u1,...,un) in G as a tour TP in Gprime that successively visits the nodes [ui[?]d+1,...ui], i = 0,...n, where we assume that uj = s for all j [?] 0 (compare Figure 2).</Paragraph>
      <Paragraph position="5"> The cost of TP can be made equal to the cost of P by making the cost of the edge from [u1,...,ud[?]1] to [u2,...,ud] equal to c(u1,...ud). (We set cprime(e) to 0 for all edges e between nodes with first component s and for the edges e with target node [sd[?]1].) Finally, we define V primeu to be the set of all nodes in Gprime with last component u. It is not hard to see that for any simple path of length n in G, we find a tour TP in Gprime with the same cost. Conversely, we can find for every tour in Gprime a simple path of length n in G with the same cost.</Paragraph>
      <Paragraph position="6"> Note that the encoding Gprime will contain many unnecessary nodes and edges. For instance, all nodes that have no incoming edges can never be used in a tour, and can be deleted. We can safely delete such unnecessary nodes in a post-processing step.</Paragraph>
      <Paragraph position="7"> An example is shown in Fig. 2. The 3PDOP instance on the left has a path (s,3,1,2), which translates to the path ([s,s],[s,3],[3,1],[1,2]) in the GATSP instance shown on the right. This path can be completed by a tour by adding the edge ([1,2],[s,s]), of cost 0. The tour indeed visits each V primeu (i.e., each column) exactly once. Nodes with last component s which are not [s,s] are unreachable and are not shown.</Paragraph>
      <Paragraph position="8"> For the special case of d = 2, the GATSP is simply an ordinary ATSP. The graphs of both problems look identical in this case, except that the GATSP instance has edges of cost 0 from any node to the source [s].</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Computing Optimal Orderings
</SectionTitle>
    <Paragraph position="0"> The equivalence of dPDOP and GATSP implies that we can now bring algorithms from the vast literature on TSP to bear on the discourse ordering problem. One straightforward method is to reduce the GATSP further to ATSP (Noon and Bean, 1993); for the case d = 2, nothing has to be done. Then one can solve the reduced ATSP instance; see (Fischetti et al., 2001; Fischetti et al., 2002) for a recent survey of exact methods.</Paragraph>
    <Paragraph position="1"> We choose the alternative of developing a new algorithm for solving GATSP directly, which uses standard techniques from combinatorial optimisation, gives us a better handle on optimising the algorithm for our problem instances, and runs more efficiently in practice. Our algorithm translates the GATSP instance into an integer linear program (ILP) and uses the branch-and-cut method (Nemhauser and Wolsey, 1988) to solve it. Integer linear programs consist of a set of linear equations and inequalities, and are solved by integer variable assignments which maximise or minimise a goal function while satisfying the other conditions.</Paragraph>
    <Paragraph position="2"> Let G = (V,E) be a directed graph and S [?] V .</Paragraph>
    <Paragraph position="3"> We define d+(S) = {(u,v) [?] E  |u [?] S and v negationslash[?] S} and d[?](S) = {(u,v) [?] E  |u /[?] S and v [?] S}, i.e. d+(S) and d[?](S) are the sets of all incoming and outgoing edges of S, respectively. We assume that the graph G has no edges within one partition Vu, since such edges cannot be used by any solution.</Paragraph>
    <Paragraph position="4"> With this assumption, GATSP can be phrased as an ILP as follows (this formulation is similar to the one proposed by Laporte et al. (1987)):</Paragraph>
    <Paragraph position="6"> We have a binary variable xe for each edge e of the graph. The intention is that xe has value 1 if e is used in the tour, and 0 otherwise. Thus the cost of the tour can be written assummationtexte[?]E cexe. The three conditions enforce the variable assignment to encode a valid GATSP tour. (1) ensures that all integer solutions encode a set of cycles. (2) guarantees that every partition Vi is visited by exactly one cycle. The inequalities (3) say that every subset of the partitions has an outgoing edge; this makes sure a solution encodes one cycle, rather than a set of multiple cycles.</Paragraph>
    <Paragraph position="7"> To solve such an ILP using the branch-and-cut method, we drop the integrality constraints (i.e. we replace xe [?] {0,1} by 0 [?] xe [?] 1) and solve the corresponding linear programming (LP) relaxation. If the solution of the LP is integral, we found the optimal solution. Otherwise we pick a variable with a fractional value and split the problem into two subproblems by setting the variable to 0 and 1, respectively. We solve the subproblems recursively and disregard a subproblem if its LP bound is worse than the best known solution.</Paragraph>
    <Paragraph position="8"> Since our ILP contains an exponential number of inequalities of type (3), solving the complete LPs directly would be too expensive. Instead, we start with a small subset of these inequalities, and test (efficiently) whether a solution of the smaller LP violates an inequality which is not in the current LP. If so, we add the inequality to the LP, resolve it, and iterate. Otherwise we found the solution of the LP with the exponential number of inequalities.</Paragraph>
    <Paragraph position="9"> The inequalities we add by need are called cutting planes; algorithms that find violated cutting planes are called separation algorithms.</Paragraph>
    <Paragraph position="10"> To keep the size of the branch-and-cut tree small, our algorithm employs some heuristics to find further upper bounds. In addition, we improve lower bound from the LP relaxations by adding further inequalities to the LP that are valid for all integral solutions, but can be violated for optimal solutions of the LP. One major challenge here was to find separation algorithms for these inequalities. We cannot go into these details for lack of space, but will discuss them in a separate paper.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We implemented the algorithm and ran it on some examples to evaluate its practical efficiency. The runtimes are shown in Tables 3 and 4 for an implementation using a branch-and-cut ILP solver which is free for all academic purposes (ILP-FS) and a commercial branch-and-cut ILP solver (ILP-CS).</Paragraph>
    <Paragraph position="1"> Our implementations are based on LEDA 4.4.1  for implementing the ILP-based branch-and-cut algorithm. SCIL can be used with different branch-and-cut core codes. We used CPLEX  Soplex/) as the free implementation. Note that all our implementations are still preliminary. The software is publicly available (www.mpi-sb.</Paragraph>
    <Paragraph position="2"> mpg.de/~althaus/PDOP.html).</Paragraph>
    <Paragraph position="3"> We evaluate the implementations on three classes of inputs. First, we use two discourses from the GNOME corpus, taken from (Karamanis, 2003), together with the centering-based cost functions from Section 2: coffers1, containing 10 discourse units, and cabinet1, containing 15 discourse units. Second, we use twelve discourses from the BLLIP corpus taken from (Lapata, 2003), together with M.LAPATA. These discourses are 4 to 13 discourse units long; the table only shows the instance with the highest running time. Finally, we generate random instances of 2PDOP of size 20-100, and of 3PDOP of size 10, 15, and 20. A random instance is the complete graph, where c(u1,...,ud) is chosen uniformly at random from {0,...,999}.</Paragraph>
    <Paragraph position="4"> The results for the 2-place instances are shown in Table 3, and the results for the 3-place instances are shown in Table 4. The numbers are runtimes in seconds on a Pentium 4 (Xeon) processor with 3.06 GHz. Note that a hypothetical baseline implementation which naively generates and evaluates all permutations would run over 77 years for a discourse of length 20, even on a highly optimistic platform that evaluates one billion permutations per second. For d = 2, all real-life instances and all random instances of size up to 50 can be solved in less than one second, with either implementation. The problem becomes more challenging for d = 3. Here the algorithm quickly establishes good LP bounds for  the real-life instances, and thus the branch-and-cut trees remain small. The LP bounds for the random instances are worse, in particular when the number of units gets larger. In this case, the further optimisations in the commercial software make a big difference in the size of the branch-and-cut tree and thus in the solution time.</Paragraph>
    <Paragraph position="5"> An example output for cabinet1 with M.NOCB is shown in Fig. 3; we have modified referring expressions to make the text more readable, and have marked discourse unit boundaries with &amp;quot;/&amp;quot; and expressions that establish local coherence with square brackets. This is one of many possible optimal solutions, which have cost 2 because of the two NOCB transitions at the very start of the discourse. Details on the comparison of different centering-based coherence measures are discussed by Karamanis et al. (2004).</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Comparison to Other Approaches
</SectionTitle>
    <Paragraph position="0"> There are two approaches in the literature that are similar enough to ours that a closer comparison is in order.</Paragraph>
    <Paragraph position="1"> The first is a family of algorithms for discourse ordering based on genetic programming (Mellish et al., 1998; Karamanis and Manurung, 2002). This is a very flexible and powerful approach, which can be applied to measures of local coherence that do not seem to fit in our framework trivially. For example, the measure from (Mellish et al., 1998) looks at the entire discourse up to the current transition for some of their cost factors. However, our algorithm is several orders of magnitude faster where a direct comparison is possible (Manurung, p.c.), and it is guaranteed to find an optimal ordering. The nonapproximability result for TSP means that a genetic (or any other) algorithm which is restricted to polynomial runtime could theoretically deliver arbitrarily bad solutions.</Paragraph>
    <Paragraph position="2"> Second, the discourse ordering problem we have discussed in this paper looks very similar to the Majority Ordering problem that arises in the context of multi-document summarisation (Barzilay et al., Both cabinets probably entered England in the early nineteenth century / after the French Revolution caused the dispersal of so many French collections. / The pair to [this monumental cabinet] still exists in Scotland. / The fleurs-de-lis on the top two drawers indicate that [the cabinet] was made for the French King Louis XIV. / [It] may have served as a royal gift, / as [it] does not appear in inventories of [his] possessions. / Another medallion inside shows [him] a few years later. / The bronze medallion above [the central door] was cast from a medal struck in 1661 which shows [the king] at the age of twenty-one. / A panel of marquetry showing the cockerel of [France] standing triumphant over both the eagle of the Holy Roman Empire and the lion of Spain and the Spanish Netherlands decorates [the central door]. / In [the Dutch Wars] of 1672 - 1678, [France] fought simultaneously against the Dutch, Spanish, and Imperial armies, defeating them all. / [The cabinet] celebrates the Treaty of Nijmegen, which concluded [the war]. / The Sun King's portrait appears twice on [this work]. / Two large figures from Greek mythology, Hercules and Hippolyta, Queen of the Amazons, representatives of strength and bravery in war appear to support [the cabinet]. / The decoration on [the cabinet] refers to [Louis XIV's] military victories. / On the drawer above the door, gilt-bronze military trophies flank a medallion portrait of [the king].</Paragraph>
    <Paragraph position="3">  2002). The difference between the two problems is that Barzilay et al. minimise the sum of all costs Cij for any pair i,j of discourse units with i &lt; j, whereas we only sum over the Cij for i = j [?] 1.</Paragraph>
    <Paragraph position="4"> This makes their problem amenable to the approximation algorithm by Cohen et al. (1999), which allows them to compute a solution that is at least half as good as the optimum, in polynomial time; i.e.</Paragraph>
    <Paragraph position="5"> this problem is strictly easier than TSP or discourse ordering. However, a Majority Ordering algorithm is not guaranteed to compute good solutions to the discourse ordering problem, as Lapata (2003) assumes. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML