File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0618_metho.xml
Size: 14,705 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0618"> <Title>Beyond the Pipeline: Discrete Optimization in NLP</Title> <Section position="5" start_page="0" end_page="136" type="metho"> <SectionTitle> 2 Solving NLP Tasks with Classi ers </SectionTitle> <Paragraph position="0"> Classi cation can be de ned as the task Ti of assigning one of a discrete set of mi possible labels Li = {li1,...,limi}1 to an unknown instance. Since generic machine-learning algorithms can be applied to solving single-valued predictions only, complex puts of many different classi ers.</Paragraph> <Paragraph position="1"> In an application implemented as a cascade of classi ers the output representation is built incrementally, with subsequent classi ers having access to the outputs of previous modules. An important characteristic of this model is its extensibility: it is generally easy to change the ordering or insert new modules at any place in the pipeline2. A major problem with sequential processing of linguistic data stems from the fact that elements of linguistic structure, at the semantic or syntactic levels, are strongly correlated with one another. Hence classi ers that have access to additional contextual information perform better than if this information is withheld. In most cases, though, if task Tk can use the output of Ti to increase its accuracy, the reverse is also true. In practice this type of processing may lead to error propagation. If due to the scarcity of contextual information the accuracy of initial classi ers is low, erroneous values passed as input to subsequent tasks can cause further misclassi cations which can distort the nal outcome (also discussed by Roth and Yih and van den Bosch et al. (1998)).</Paragraph> <Paragraph position="2"> As can be seen in Figure 1, solving classi cation tasks sequentially corresponds to the best- rst traversal of a weighted multi-layered lattice. Nodes at separate layers (T1,...,Tn) represent labels of different classi cation tasks and transitions between the nodes are augmented with probabilities of se2Both operations only require retraining classi ers with a new selection of the input features.</Paragraph> <Paragraph position="3"> lecting respective labels at the next layer. In the sequential model only transitions between nodes belonging to subsequent layers are allowed. At each step, the transition with the highest local probability is selected. Selected nodes correspond to outcomes of individual classi ers. This graphical representation shows that sequential processing does not guarantee an optimal context-dependent assignment of class labels and favors tasks that occur later, by providing them with contextual information, over those that are solved rst.</Paragraph> </Section> <Section position="6" start_page="136" end_page="138" type="metho"> <SectionTitle> 3 Discrete Optimization Model </SectionTitle> <Paragraph position="0"> As an alternative to sequential ordering of NLP tasks we consider the metric labeling problem formulated by Kleinberg & Tardos (2000), and originally applied in an image restoration application, where classi ers determine the true intensity values of individual pixels. This task is formulated as a labeling function f : P - L, that maps a set P of n objects onto a set L of m possible labels. The goal is to nd an assignment that minimizes the overall cost function Q(f), that has two components: assignment costs, i.e. the costs of selecting a particular label for individual objects, and separation costs, i.e.</Paragraph> <Paragraph position="1"> the costs of selecting a pair of labels for two related objects3. Chekuri et al. (2001) proposed an integer linear programming (ILP) formulation of the metric labeling problem, with both assignment cost and separation costs being modeled as binary variables of the linear cost function.</Paragraph> <Paragraph position="2"> Recently, Roth & Yih (2004) applied an ILP model to the task of the simultaneous assignment of semantic roles to the entities mentioned in a sentence and recognition of the relations holding between them. The assignment costs were calculated on the basis of predictions of basic classi ers, i.e.</Paragraph> <Paragraph position="3"> trained for both tasks individually with no access to the outcomes of the other task. The separation costs were formulated in terms of binary constraints, that speci ed whether a speci c semantic role could occur in a given relation, or not.</Paragraph> <Paragraph position="4"> In the remainder of this paper, we present a more general model, that is arguably better suited to handling different NLP problems. More speci cally, we put no limits on the number of tasks being solved, and express the separation costs as stochastic constraints, which for almost any NLP task can be calculated off-line from the available linguistic data.</Paragraph> <Section position="1" start_page="137" end_page="138" type="sub_section"> <SectionTitle> 3.1 ILP Formulation </SectionTitle> <Paragraph position="0"> We consider a general context in which a speci c NLP problem consists of individual linguistic decisions modeled as a set of n classi cation tasks</Paragraph> <Paragraph position="2"> related pairs. Each task Ti consists in assigning a label from Li = {li1,...,limi} to an instance that represents the particular decision. Assignments are modeled as variables of a linear cost function. We differentiate between simple variables that model individual assignments of labels and compound variables that represent respective assignments for each pair of related tasks.</Paragraph> <Paragraph position="3"> To represent individual assignments the following procedure is applied: for each task Ti, every label from Li is associated with a binary variable x(lij).</Paragraph> <Paragraph position="4"> Each such variable represents a binary choice, i.e. a respective label lij is selected if x(lij) = 1 or rejected otherwise. The coef cient of variable x(lij), that models the assignment cost c(lij), is given by:</Paragraph> <Paragraph position="6"> where p(lij) is the probability of lij being selected as the outcome of task Ti. The probability distribution for each task is provided by the basic classi ers that do not consider the outcomes of other tasks4.</Paragraph> <Paragraph position="7"> The role of compound variables is to provide pair-wise constraints on the outcomes of individual tasks. Since we are interested in constraining only those tasks that are truly dependent on one another we rst apply the contingency coef cient C to measure the degree of correlation for each pair of tasks5.</Paragraph> <Paragraph position="8"> In the case of tasks Ti and Tk which are signi cantly correlated, for each pair of labels from variables, and hence adequate for the type of tasks that we consider here. The coef cient takes values from 0 (no correlation) to 1 (complete correlation) and is calculated by the formula:</Paragraph> <Paragraph position="10"> and N the total number of instances. The signi cance of C is then determined from the value of kh2 for the given data. See e.g. Goodman & Kruskal (1972).</Paragraph> <Paragraph position="11"> Li x Lk we build a single variable x(lij,lkp). Each such variable is associated with a coef cient representing the constraint on the respective pair of labels lij,lkp calculated in the following way: c(lij, lkp) = [?]log2(p(lij,lkp)) with p(lij,lkp) denoting the prior joint probability of labels lij, and lkp in the data, which is independent from the general classi cation context and hence can be calculated off-line6.</Paragraph> <Paragraph position="12"> The ILP model consists of the target function and a set of constraints which block illegal assignments (e.g. only one label of the given task can be selected)7. In our case the target function is the cost function Q(f), which we want to minimize:</Paragraph> <Paragraph position="14"> Constraints need to be formulated for both the simple and compound variables. First we want to ensure that exactly one label lij belonging to task Ti is selected, i.e. only one simple variable x(lij) representing labels of a given task can be set to 1:</Paragraph> <Paragraph position="16"> We also require that if two simple variables x(lij) and x(lkp), modeling respectively labels lij and lkp are set to 1, then the compound variable x(lij,lkp), which models co-occurrence of these labels, is also set to 1. This is done in two steps: we rst ensure that if x(lij) = 1, then exactly one variable x(lij,lkp) must also be set to 1:</Paragraph> <Paragraph position="18"> [?]i, k [?] {1, ..., n}, i < k [?] j [?] {1, ..., mi} and do the same for variable x(lkp):</Paragraph> <Paragraph position="20"/> <Paragraph position="22"> [?]i, k [?] {1, ..., n}, i < k [?] p [?] {1, ..., mk} Finally, we constrain the values of both simple and compound variables to be binary: x(lij) [?] {0, 1} [?] x(lij, lkp) [?] {0, 1}, [?]i, k [?] {1, ..., n} [?] j [?] {1, ..., mi} [?] p [?] {1, ..., mk}</Paragraph> </Section> <Section position="2" start_page="138" end_page="138" type="sub_section"> <SectionTitle> 3.2 Graphical Representation </SectionTitle> <Paragraph position="0"> We can represent the decision process that our ILP model involves as a graph, with the nodes corresponding to individual labels and the edges marking the association between labels belonging to correlated tasks. In Figure 2, task T1 is correlated with task T2 and task T2 with task Tn. No correlation exists for pair T1,Tn. Both nodes and edges are augmented with costs. The goal is to select a sub-set of connected nodes, minimizing the overall cost, given that for each group of nodes T1,T2,...,Tn exactly one node must be selected, and the selected nodes, representing correlated tasks, must be connected. We can see that in contrast to the pipeline approach (cf. Figure 1), no local decisions determine the overall assignment as the global distribution of costs is considered.</Paragraph> </Section> </Section> <Section position="7" start_page="138" end_page="139" type="metho"> <SectionTitle> 4 Application for NL Generation Tasks </SectionTitle> <Paragraph position="0"> We applied the ILP model described in the previous section to integrate different tasks in an NLG application that we describe in detail in Marciniak & Strube (2004). Our classi cation-based approach to language generation assumes that different types of linguistic decisions involved in the generation process can be represented in a uniform way as classi cation problems. The linguistic knowledge required to solve the respective classi cations is then learned from a corpus annotated with both semantic and grammatical information. We have applied this framework to generating natural language route directions, e.g.: (a) Standing in front of the hotel (b) follow Meridian street south for about 100 meters, (c) passing the First Union Bank entrance on your right, (d) until you see the river side in front of you.</Paragraph> <Paragraph position="1"> We analyze the content of such texts in terms of temporally related situations, i.e. actions (b), states (a) and events (c,d), denoted by individual discourse units8. The semantics of each discourse unit is further given by a set of attributes specifying the semantic frame and aspectual category of the proled situation. Our corpus of semantically annotated route directions comprises 75 texts with a total number of 904 discourse units (see Marciniak & Strube (2005)). The grammatical form of the texts is modeled in terms of LTAG trees also represented as feature vectors with individual features denoting syntactic and lexical elements at both the discourse and clause levels. The generation of each discourse unit consists in assigning values to the respective features, of which the LTAG trees are then assembled. In Marciniak & Strube (2004) we implemented the generation process sequentially as a cascade of classi ers that realized incrementally the vector representation of the generated text's form, given the meaning vector as input. The classi ers handled the following eight tasks, all derived from the LTAG-based representation of the grammatical form: T1: Discourse Units Rank is concerned with ordering discourse units at the local level, i.e. only clauses temporally related to the same parent clause are considered. This task is further split into a series of binary precedence classi cations that determine the relative position of two discourse units at a time</Paragraph> <Paragraph position="3"> T2: Discourse Unit Position speci es the position of the child discourse unit relative to the parent one (e.g. (a) left of (b), (c) right of (b), etc.).</Paragraph> <Paragraph position="4"> T3: Discourse Connective determines the lexical form of the discourse connective (e.g. null in (a), until in (d)).</Paragraph> <Paragraph position="5"> T4: S Expansion speci es whether a given discourse unit would be realized as a clause with the explicit subject (i.e. np+vp expansion of the root S node in a clause) (e.g. (d)) or not (e.g. (a), (b)). T5: Verb Form determines the form of the main verb in a clause (e.g. gerund in (a), (c), bare in nitive in (b), nite present in (d)).</Paragraph> <Paragraph position="6"> T6: Verb Lexicalization provides the lexical form of the main verb (e.g. stand, follow, pass, etc.).</Paragraph> <Paragraph position="7"> T7: Phrase Type determines for each verb argu null ment in a clause its syntactic realization as a noun phrase, prepositional phrase or a particle.</Paragraph> <Paragraph position="8"> T8: Phrase Rank determines the ordering of verb arguments within a clause. As in T1 this task is split into a number binary classi cations.</Paragraph> <Paragraph position="9"> To apply the LP model to the generation problem discussed above, we rst determined which pairs of tasks are correlated. The obtained network (Figure 3) is consistent with traditional analyses of the linguistic structure in terms of adjacent but separate levels: discourse, clause, phrase. Only a few correlations extend over level boundaries and tasks within those levels are correlated. As an example consider three interrelated tasks: Connective, S Exp. and Verb Form and their different realizations presented in Table 1. Apparently different realization of any of these tasks can affect the overall meaning of a discourse unit or its stylistics. It can also be seen that only certain combinations of different forms are allowed in the given semantic context. We can conclude that for such groups of tasks sequential processing may fail to deliver an optimal assignment.</Paragraph> </Section> class="xml-element"></Paper>