File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1070_metho.xml

Size: 17,172 bytes

Last Modified: 2025-10-06 14:09:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1070">
  <Title>Instance-based Sentence Boundary Determination by Optimization for Natural Language Generation</Title>
  <Section position="4" start_page="565" end_page="566" type="metho">
    <SectionTitle>
3 Examples
</SectionTitle>
    <Paragraph position="0"> Before we describe our approach in detail, we start with a few examples from the real-estate domain to demonstrate the properties of the proposed approach. null First, sentence complexity impacts sentence boundary determination. As shown in Table 1, after receiving a user's request (U1) for the details of a house, the content planner asked the sentence planner to describe the house with a set of attributes including its asking price, style, number of bedrooms, number of bathrooms, square footage, garage, lot size, property tax, and its associated town and school</Paragraph>
    <Section position="1" start_page="566" end_page="566" type="sub_section">
      <SectionTitle>
Example Turn Sentence
</SectionTitle>
      <Paragraph position="0"> E1 U1 Tell me more about this house S1 This is a 1 million dollar 3 bedroom, 2 bathroom, 2000 square foot colonial with 2 acre of land, 2 car garage, annual taxes 8000 dollars in Armonk and in the Byram Hills school district.</Paragraph>
      <Paragraph position="1"> S2 This is a 1 million dollar house. This is a 3 bedroom house. This is a 2 bathroom house. This house has 2000 square feet. This house has 2 acres of land. This house has 2 car garage. This is a colonial house. The annual taxes are 8000 dollars. This house is in Armonk. This house is in the Byram Hills school district.</Paragraph>
      <Paragraph position="2"> S3 This is a 3 bedroom, 2 bathroom, 2000 square foot colonial located in Armonk with 2 acres of land. The asking price is 1 million dollar and the annual taxes are 8000 dollars. The house is located in the Byram Hills School District.</Paragraph>
      <Paragraph position="3"> E2 S4 This is a 1 million dollar 3 bedroom house. This is a 2 bathroom house with annual taxes of 8000 dollars.</Paragraph>
      <Paragraph position="4">  district name. Without proper sentence boundary determination, a sentence planner may formulate a single sentence to convey all the information, as in S1. Even though S1 is grammatically correct, it is too complex and too exhausting to read. Similarly, output like S2, despite its grammatical correctness, is choppy and too tedious to read. In contrast, our instance-based sentence boundary determination module will use examples in a corpus to partition those attributes into several sentences in a more balanced manner (S3).</Paragraph>
      <Paragraph position="5"> Semantic cohesion also influences the quality of output sentences. For example, in the real-estate domain, the number of bedrooms and number of bathrooms are two closely related concepts. Based on our corpus, when both concepts appear, they almost always conveyed together in the same sentence. Given this, if the content planner wants to convey a house with the following attributes: price, number of bedrooms, number of bathrooms, and property tax, S4 is a less desirable solution than S5 because it splits these concepts into two separate sentences. Since we use instance-based sentence boundary determination, our method generates S5 to minimize the difference from the corpus instances.</Paragraph>
      <Paragraph position="6"> Sentence boundary placement is also sensitive to the syntactic and lexical realizability of grouped items. For example, if the sentence planner asks the surface realizer to convey two propositions S6 and S7 together in a sentence, a realization failure will be triggered because both S6 and S7 only exist in the corpus as independent sentences. Since neither of them can be transformed into a modifier based on the corpus, S6 and S7 cannot be aggregated in our system. Our method takes a sentence's lexical and syntactic realizability into consideration in order to avoid making such aggregation request to the surface realizer in the first place.</Paragraph>
      <Paragraph position="7"> A generation system's own capability may also influence sentence boundary determination. Good sentence boundary decisions will balance a system's strengths and weaknesses. In contrast, bad decisions will expose a system's venerability. For example, if a sentence generator is good at performing aggregations and weak on referring expressions, we may avoid incoherence between sentences by preferring aggregating more attributes in one sentence (like in S8) rather than by splitting them into multiple sentences (like in S9).</Paragraph>
      <Paragraph position="8"> In the following, we will demonstrate how our approach can achieve all the above goals in a unified instance-based framework.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="566" end_page="570" type="metho">
    <SectionTitle>
4 Instance-based boundary determination
</SectionTitle>
    <Paragraph position="0"> Instance-based generation automatically creates sentences that are similar to those generated by humans, including their way of grouping semantic content, their wording and their style. Previously, Pan and Shaw (2004) have demonstrated that instance-based learning can be applied successfully in generating new sentences by piecing together existing words and segments in a corpus. Here, we want to demonstrate that by applying the same principle, we can make better sentence boundary decisions.</Paragraph>
    <Paragraph position="1">  The key idea behind the new approach is to find a sentence boundary solution that minimizes the expected difference between the sentences resulting from these boundary decisions and the examples in the corpus. Here we measure the expected difference based a set of cost functions.</Paragraph>
    <Section position="1" start_page="567" end_page="568" type="sub_section">
      <SectionTitle>
4.1 Optimization Criteria
</SectionTitle>
      <Paragraph position="0"> We use three sentence complexity and quality related cost functions as the optimization criteria: sentence boundary cost, insertion cost and deletion cost.</Paragraph>
      <Paragraph position="1"> Sentence boundary cost (SBC): Assuming P is a set of propositions to be conveyed and S is a collection of example sentences selected from the corpus to convey P. Then we say P can be realized by S with a sentence boundary cost that is equal to (|S|[?]1) [?]SBC in which |S |is the number of sentences and SBC is the sentence boundary cost. To use a specific example from the real-estate domain, the input P has three propositions:  . This is a colonial house.</Paragraph>
      <Paragraph position="2"> Since only one sentence boundary is involved, S is a solution containing one boundary cost. In the above example, even though both s  is not quite smooth. They sound choppy and disjointed. To penalize this, whenever there is a sentence break, there is a SBC. In general, the SBC is a parameter that is sensitive to a generation system's capability such as its competence in reference expression generation. If a generation system does not have a robust approach for tracking the focus across sentences, it is likely to be weak in referring expression generation and adding sentence boundaries are likely to cause fluency problems. In contrast, if a generation system is very capable in maintaining the coherence between sentences, the proper sentence boundary cost would be lower.</Paragraph>
      <Paragraph position="3"> Insertion cost: Assume P is the set of propositions to be conveyed, and C i is an instance in the corpus that can be used to realize P by inserting a missing proposition p</Paragraph>
      <Paragraph position="5"> , then we say P can be realized using C</Paragraph>
      <Paragraph position="7"> is the host sentence in the corpus containing proposition p j . Using an example from our real-estate domain, assume the input  ), in which C H is a sentence in the corpus such as &amp;quot;This is a house with 2000 square feet.&amp;quot; The insertion cost is influenced by two main factors: the syntactic and lexical insertability of the proposition p j and a system's capability in aggregating propositions. For example, if in the corpus, the proposition p j is always realized as an independent sentence and never as a modifier, icost([?],p j ) should be extremely high, which effectively prohibit p j from becoming a part of another sentence. icost([?],p j ) is defined as the minimum insertion cost among all the icost(C</Paragraph>
      <Paragraph position="9"> ) is computed dynamically based on properties of corpus instances. In addition, since whether a proposition is insertable depends on how capable an aggregation module can combine propositions correctly into a sentence, the insertion cost should be assigned high or low accordingly.</Paragraph>
      <Paragraph position="10"> Deletion cost: Assume P is a set of input propositions to be conveyed and C</Paragraph>
      <Paragraph position="12"> is an instance in the corpus that can be used to convey P by deleting an unneeded proposition p</Paragraph>
      <Paragraph position="14"> As a specific example, assuming the input is P=(p  of the verb, will make the rest of the sentence incomplete. As a result, dcost(C</Paragraph>
      <Paragraph position="16"> ) is very expensive. In contrast, dcost(C</Paragraph>
      <Paragraph position="18"> ) is low because the resulting sentence is still grammatically sound. Cur-</Paragraph>
      <Paragraph position="20"> ) is computed dynamically based on properties of corpus instances. Second, the expected performance of a generation system in deletion also impacts the deletion cost. Depending on the sophistication of the generator to handle various deletion situations, the expected deletion cost can be high if the method employed is naive and error prone, or is low if the system can handle most cases accurately.</Paragraph>
      <Paragraph position="21"> Overall cost: Assume P is the set of propositions to be conveyed and S is the set of instances in the corpus that are chosen to realize P by applying a set of insertion, deletion and sentence breaking operations, the overall cost of the solution</Paragraph>
      <Paragraph position="23"> and SBC are the insertion weight, deletion weight and sentence boundary cost; N</Paragraph>
    </Section>
    <Section position="2" start_page="568" end_page="569" type="sub_section">
      <SectionTitle>
4.2 Algorithm: Optimization based on overall
cost
</SectionTitle>
      <Paragraph position="0"> We model the sentence boundary determination process as a branch and bound tree search problem. Before we explain the algorithm itself, first a few notations. The input P is a set of input propositions chosen by the content planner to be realized. S is the set of all possible propositions in an application domain. Each instance C</Paragraph>
      <Paragraph position="2"> in the corpus C is represented as a subset of S. Assume S is a solution to P, then it can be represented as the overall cost plus a list of pairs like (C</Paragraph>
      <Paragraph position="4"> of the instances selected to be used in that solution,</Paragraph>
      <Paragraph position="6"> To explain this representation further, we use a specific example in which P=(a, d, e, f), S=(a, b, c, d, e, f g, h, i). One of the boundary solution S can be  are two corpus instances selected as the bases to formulate the solution and C  is the host sentence containing proposition f. The general idea behind the instance-based branch and bound tree search algorithm is that given an input, P, for each corpus instance C</Paragraph>
      <Paragraph position="8"> struct a search branch, representing all possible ways to realize the input using the instance plus deletions, insertions and sentence breaks. Since each sentence break triggers a recursive call to our sentence boundary determination algorithm, the complexity of the algorithm is NP-hard. To speed up the process, for each iteration, we prune unproductive branches using an upper bound derived by several greedy algorithms. The details of our sentence boundary determination algorithm, sbd(P), are described below. P is the set of input propositions.</Paragraph>
      <Paragraph position="9">  1. Set the current upper bound, UB, to the minimum cost of solutions derived by greedy algorithms, which we will describe later. This value is used to prune unneeded branches to make the search more efficient.</Paragraph>
      <Paragraph position="10"> 2. For each instance C</Paragraph>
      <Paragraph position="12"> is to identify all the useful corpus instances for realizing P.</Paragraph>
      <Paragraph position="13"> 3. Delete all the propositions p</Paragraph>
      <Paragraph position="15"> but not exist in P) with cost Cost  9. These steps figure out all the possible ways to add the missing propositions, including inserting into the instance C i and separating the rest as independent sentence(s).  in which Cost(Q) is the cost of sbd(Q) which recursively computes the best solution for input Q and Q [?] P. To facilitate dynamic programming, we remember the best solution for Q derived by sbd(Q) in case Q is used to formulate other solutions.</Paragraph>
      <Paragraph position="16"> 7. If the lower bound for Cost(P) is greater than the established upper bound UB, prune this branch.</Paragraph>
      <Paragraph position="17"> 8. Using the notation described in the beginning of Sec. 4.2, we update the current solution to  is an operator that composes two partial solutions.</Paragraph>
      <Paragraph position="18"> 9. If sbd(P) is a complete solution (either Q is empty or have a known best solution) and Cost(P) &lt;UB, update the upper bound UB = Cost(P).</Paragraph>
      <Paragraph position="19"> 10. Output the solution with the lowest overall cost.  To establish the initial UB for pruning, we use the minimum of the following three bounds. In general, the tighter the UB is, the more effective the pruning is.</Paragraph>
      <Paragraph position="20"> Greedy set partition: we employ a greedy set partition algorithm in which we first match the set S [?] P with the largest |S|. Repeat the same process for P prime where P prime = P [?] S. The solution cost is Cost(P)=(N [?] 1) [?] SBC, and N is the number of sentences in the solution. The complexity of this computation is O(|P|), where |P |is the number of propositions in P.</Paragraph>
      <Paragraph position="21"> Revised minimum set covering: we employ a greedy minimum set covering algorithm in which we first find the set S in the corpus that maximizes the overlapping of propositions in the input P. The unwanted propositions in S [?] P are deleted. As- null and the previous approach is that S here might not be a subset of P. The complexity of this computation is O(|P|).</Paragraph>
      <Paragraph position="22"> One maximum overlapping sentence: we first identify the instance C i in corpus that covers the maximum number of propositions in P. To arrive at a solution for P, the rest of the propositions not</Paragraph>
      <Paragraph position="24"> and all the unwanted propositions in C i are deleted. The cost of this solution is  in which D includes proposition in C i but not in P, and I includes propositions in P but not in C</Paragraph>
      <Paragraph position="26"> Currently, we update UB only after a complete solution is found. It is possible to derive better UB by establishing the upper bound for each partial solution, but the computational overhead might not justify doing so.</Paragraph>
    </Section>
    <Section position="3" start_page="569" end_page="570" type="sub_section">
      <SectionTitle>
4.3 Approximation Algorithm
</SectionTitle>
      <Paragraph position="0"> Even with pruning and dynamic programming, the exact solution still is very expensive computationally. Computing exact solution for an input size of 12 propositions has over 1.6 millions states and takes more than 30 minutes (see Figure 1). To make the search more efficient for tasks with a large number of propositions in the input, we naturally seek a greedy strategy in which at every iteration the algorithm myopically chooses the next best step without regard for its implications on future moves. One greedy search policy we implemented explores the branch that uses the instance with maximum overlapping propositions with the input and ignores all branches exploring other corpus instances. The intuition behind this policy is that the more overlap an instance has with the input, the less insertions or sentence breaks are needed.</Paragraph>
      <Paragraph position="1"> Figure 1 and Figure 2 demonstrate the trade-off between computation efficiency and accuracy.</Paragraph>
      <Paragraph position="2"> In this graph, we use instances from the real-estate corpus with size 250, we vary the input sentence length from one to twenty and the results shown in the graphs are average value over several typical weight configurations ((W  (1,3,5),(1,3,7),(1,5,3),(1,7,3),(1,1,1)). Figure 2 compares the quality of the solutions when using exact solutions versus approximation. In our interactive multimedia system, we currently use exact solution for input size of 7 propositions or less and switch to greedy for any larger input size to ensure sub-second performance for the NLG component.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML