File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1046_metho.xml
Size: 15,657 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1046"> <Title>Aggregation via Set Partitioning for Natural Language Generation</Title> <Section position="4" start_page="360" end_page="360" type="metho"> <SectionTitle> 3 Problem Formulation </SectionTitle> <Paragraph position="0"> We formulate aggregation as a supervised partitioning task, where the goal is to find a clustering of input items that maximizes a global utility function. The input to the model consists of a set E of database entries selected by a content planner.</Paragraph> <Paragraph position="1"> The output of the model is a partition S = {Si} of nonempty subsets such that each element of E appears in exactly one subset.2 In the context of aggregation, each partition represents entries that should be verbalized in the same sentence. An example of a partitioning is illustrated in the right side of Table 2 where eight entries are partitioned into six clusters.</Paragraph> <Paragraph position="2"> We assume access to a relational database where each entry has a type and a set of attributes associated with it. Table 2 (left) shows an excerpt of the database we used for our experiments.</Paragraph> <Paragraph position="3"> The aggregated text in Table 2 (right) contains entries of five types: Passing, Interception, Kicking, Rushing, and Fumbles. Entries of type Passing have six attributes -- PLAYER, 2By definition, a partitioning of a set defines an equivalence relation which is reflexive, symmetric, and transitive.</Paragraph> <Paragraph position="4"> CP/AT, YDS, AVG, TD, INT, entries of type Interception have four attributes, and so on.</Paragraph> <Paragraph position="5"> We assume the existence of a non-empty set of attributes that we can use for meaningful comparison between entities of different types. In the example above, types Passing and Rushing share the attributes PLAYER,AVG(short for average), TD(short for touchdown) and YDS(short for yards). These are indicated in boldface in Table 2. In Section 4.1, we discuss how a set of shared attributes can be determined for a given database.</Paragraph> <Paragraph position="6"> Our training data consists of entry sets with a known partitioning. During testing, our task is to infer a partitioning for an unseen set of entries.</Paragraph> </Section> <Section position="5" start_page="360" end_page="363" type="metho"> <SectionTitle> 4 Modeling </SectionTitle> <Paragraph position="0"> Our model is inspired by research on text aggregation in the natural language generation community (Cheng and Mellish, 2000; Shaw, 1998). A common theme across different approaches is the notion of similarity -- content elements described in the same sentence should be related to each other in some meaningful way to achieve conciseness and coherence. Consider for instance the first cluster in Table 2. Here, we have two entries of the same type (i.e., Passing). Furthermore, the entries share the same values for the attributes YDS and TD (i.e., 237 and 1). On the other hand, clusters 5 and 6 have no attributes in common. This observation motivates modeling aggregation as a binary classification task: given a pair of entries, predict their aggregation status based on the similarity of their attributes. Assuming a perfect classifier, pairwise assignments will be consistent with each other and will therefore yield a valid partitioning.</Paragraph> <Paragraph position="1"> In reality, however, this approach may produce globally inconsistent decisions since it treats each pair of entries in isolation. Moreover, a pairwise classification model cannot express general constraints regarding the partitioning as a whole. For example, we may want to constrain the size of the generated partitions and the compression rate of the document, or the complexity of the generated sentences. null To address these requirements, our approach relies on global inference. Given the pairwise predictions of a local classifier, our model finds a globally optimal assignment that satisfies partitioninglevel constraints. The computational challenge lies in the complexity of such a model: we need to find an optimal partition in an exponentially large search space. Our approach is based on an Integer Linear Programming (ILP) formulation which can be effectively solved using standard optimization tools. ILP models have been successfully applied in several natural language processing tasks, including relation extraction (Roth and Yih, 2004), semantic role labeling (Punyakanok et al., 2004) and the generation of route directions (Marciniak and Strube, 2005).</Paragraph> <Paragraph position="2"> In the following section, we introduce our local pairwise model and afterward we present our global model for partitioning.</Paragraph> <Section position="1" start_page="361" end_page="361" type="sub_section"> <SectionTitle> 4.1 Learning Pairwise Similarity </SectionTitle> <Paragraph position="0"> Our goal is to determine whether two database entries should be aggregated given the similarity of their shared attributes. We generate the training data by considering all pairs <ei, ej> [?] E x E, where E is the set of all entries attested in a given document.</Paragraph> <Paragraph position="1"> An entry pair forms a positive instance if its members belong to the same partition in the training data.</Paragraph> <Paragraph position="2"> For example, we will generate 8x72 unordered entry pairs for the eight entries from the document in Table 2. From these, only two pairs constitute positive instances, i.e., clusters 1 and 2. All other pairs form negative instances.</Paragraph> <Paragraph position="3"> The computation of pairwise similarity is based on the attribute set A = {Ai} shared between the two entries in the pair. As discussed in Section 3, the same attributes can characterize multiple entry types, and thus form a valid basis for entry comparison. The shared attribute set A could be identified in many ways. For example, using domain knowledge or by selecting attributes that appear across multiple types. In our experiments, we follow the second approach: we order attributes by the number of entry types in which they appear, and select the top five3.</Paragraph> <Paragraph position="4"> A pair of entries is represented by a binary feature vector {xi} in which coordinate xi indicates whether two entries have the same value for attribute i. The feature vector is further expanded by conjuctive features that explicitly represent overlap in values of multiple attributes up to size k. The parameter k controls the cardinality of the maximal conjunctive set and is optimized on the development set.</Paragraph> <Paragraph position="5"> To illustrate our feature generation process, consider the pair (Passing (Quincy Carter 15/32 116 3.6 1 0)) and (Rushing (Troy Hambrick 13 33 2.5 10 1)) from Table 2. Assuming A = {Player,Yds,TD} and k = 2, the similarity between the two entries will be expressed by six features, three representing overlap in individual attributes and three representing overlap when considering pairs of attributes. The resulting feature vector has the form <0, 0, 1, 0, 0, 0> .</Paragraph> <Paragraph position="6"> Once we define a mapping from database entries to features, we employ a machine learning algorithm to induce a classifier on the feature vectors generated from the training documents. In our experiments, we used a publicly available maximum entropy classifier4 for this task.</Paragraph> </Section> <Section position="2" start_page="361" end_page="363" type="sub_section"> <SectionTitle> 4.2 Partitioning with ILP </SectionTitle> <Paragraph position="0"> Given the pairwise predictions of the local classifier, we wish to find a valid global partitioning for the entries in a single document. We thus model the interaction between all pairwise aggregation decisions as an optimization problem.</Paragraph> <Paragraph position="1"> Let c<ei,ej> be the probability of seeing entry pair <ei, ej> aggregated (as computed by the pairwise classifier). Our goal is to find an assignment that maximizes the sum of pairwise scores and forms a valid partitioning. We represent an assignment using a set of indicator variables x<ei,ej> that are set to 1 if <ei, ej> is aggregated, and 0 otherwise. The score of a global assignment is the sum of its pair- null x<ei,ej> [?] {0, 1} [?] ei, ej [?] E x E (3) We augment this basic formulation with two types of constraints. The first type of constraint ensures that pairwise assignments lead to a consistent partitioning, while the second type expresses global constraints on partitioning.</Paragraph> <Paragraph position="2"> Transitivity Constraints We place constraints that enforce transitivity in the label assignment: if x<ei,ej> = 1 and x<ej,ek> = 1, then x<ei,ek> = 1.</Paragraph> <Paragraph position="3"> A pairwise assignment that satisfies this constraint defines an equivalence relation, and thus yields a unique partitioning of input entries (Cormen et al., 1992).</Paragraph> <Paragraph position="4"> We implement transitivity constraints by introducing for every triple ei, ej, ek (i negationslash= j negationslash= k) an inequality of the following form: x<ei,ek> [?] x<ei,ej> + x<ej,ek> [?] 1 (4) If both x<ei,ej> and x<ej,ek> are set to one, then x<ei,ek> also has to be one. Otherwise, x<ei,ek> can be either 1 or 0.</Paragraph> <Paragraph position="5"> Global Constraints We also want to consider global document properties that influence aggregation. For example, documents with many database entries are likely to exhibit different compression rates during aggregation when compared to documents that contain only a few.</Paragraph> <Paragraph position="6"> Our first global constraint controls the number of aggregated sentences in the document. This is achieved by limiting the number of entry pairs with positive labels for each document:</Paragraph> <Paragraph position="8"> Notice that the number m is not known in advance. However, we can estimate this parameter from our development data by considering documents of similar size (as measured by the number of corresponding entry pairs.) For example, texts with thousand entry pairs contain on average 70 positive labels, while documents with 200 pairs have around 20 positive labels. Therefore, we set m separately for every document by taking the average number of positive labels observed in the development data for the document size in question.</Paragraph> <Paragraph position="9"> The second set of constraints controls the length of the generated sentences. We expect that there is an upper limit on the number of pairs that can be clustered together. This restriction can be expressed in the following form:</Paragraph> <Paragraph position="11"> This constraint ensures that there can be at most k positively labeled pairs for any entry ei. In our corpus, for instance, at most nine entries can be aggregated in a sentence. Again k is estimated from the development data by taking into account the average number of positively labeled pairs for every entry type (see Table 2). We therefore indirectly capture the fact that some entry types (e.g., Passing) are more likely to be aggregated than others (e.g., Kicking).</Paragraph> <Paragraph position="12"> Solving the ILP In general, solving an integer linear program is NP-hard (Cormen et al., 1992). Fortunately, there exist several strategies for solving ILPs. In our study, we employed lp solve, an efficient Mixed Integer Programming solver5 which implements the Branch-and-Bound algorithm. We generate and solve an ILP for every document we wish to aggregate. Documents of average size (approximately 350 entry pairs) take under 30 minutes on a 450 MHz Pentium III machine.</Paragraph> </Section> </Section> <Section position="6" start_page="363" end_page="364" type="metho"> <SectionTitle> 5 Evaluation Set-up </SectionTitle> <Paragraph position="0"> The model presented in the previous section was evaluated in the context of generating summary reports for American football games. In this section we describe the corpus used in our experiments, our procedure for estimating the parameters of our models, and the baseline method used for comparison with our approach.</Paragraph> <Paragraph position="1"> Data For training and testing our algorithm, we employed a corpus of football game summaries collected by Barzilay and Lapata (2005). The corpus contains 468 game summaries from the official site of the American National Football League6 (NFL).</Paragraph> <Paragraph position="2"> Each summary has an associated database containing statistics about individual players and events. In total, the corpus contains 73,400 database entries, 7.1% of which are verbalized; each entry is characterized by a type and a set of attributes (see Table 2). Database entries are automatically aligned with their corresponding sentences in the game summaries by a procedure that considers anchor overlap between entity attributes and sentence tokens. Although the alignment procedure is relatively accurate, there is unavoidably some noise in the data.</Paragraph> <Paragraph position="3"> The distribution of database entries per sentence is shown in Figure 1. As can be seen, most aggregated sentences correspond to two or three database entries. Each game summary contained 14.3 entries and 9.1 sentences on average. The training and test data were generated as described in Section 4.1. We used 96,434 instances (300 summaries) for training, 59,082 instances (68 summaries) for testing, and 53,776 instances (100 summaries) for development purposes.</Paragraph> <Paragraph position="4"> Parameter Estimation As explained in Section 4, we infer a partitioning over a set of database entries in a two-stage process. We first determine how likely all entry pairs are to be aggregated using a local classifier, and then infer a valid global partitioning for all entries. The set of shared attributes A consists of five features that capture overlap in players, time (measured by game quarters), action type, NFL corpus Overall, our local classifier used 28 features, including 23 conjunctive ones. The maximum entropy classifier was trained for 100 iterations. The global constraints for our ILP models are parametrized (see equations (5) and (6)) by m and k which are estimated separately for every test document. The values of m ranged from 2 to 130 and for k from 2 to 9. Baseline Clustering is a natural baseline model for our partitioning problem. In our experiments, we a employ a single-link agglomerative clustering algorithm that uses the scores returned by the maximum entropy classifier as a pairwise distance measure. Initially, the algorithm creates a separate cluster for each sentence. During each iteration, the two closest clusters are merged. Again, we do not know in advance the appropriate number of clusters for a given document. This number is estimated from the training data by averaging the number of sentences in documents of the same size.</Paragraph> <Paragraph position="5"> Evaluation Measures We evaluate the performance of the ILP and clustering models by measuring F-score over pairwise label assignments. We compute F-score individually for each document and report the average. In addition, we compute partition accuracy in order to determine how many sentence-level aggregations our model predicts correctly. cision, recall, and F-score are averaged over documents); comparison between clustering and ILP models</Paragraph> </Section> class="xml-element"></Paper>