File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1197_metho.xml
Size: 21,923 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1197"> <Title>Semantic Role Labeling via Integer Linear Programming Inference</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Task Description </SectionTitle> <Paragraph position="0"> The goal of the semantic-role labeling task is to discover the verb-argument structure for a given input sentence. For example, given a sentence &quot; I left my pearls to my daughter-in-law in my will&quot;, the goal is to identify different arguments of the verb left which yields the output: [A0 I][V left ][A1 my pearls][A2 to my daughter-in-law] [AM-LOC in my will].</Paragraph> <Paragraph position="1"> Here A0 represents the leaver, A1 represents the thing left, A2 represents the benefactor, AM-LOC is an adjunct indicating the location of the action, and V determines the verb.</Paragraph> <Paragraph position="2"> Following the definition of the PropBank, and CoNLL-2004 shared task, there are six different types of arguments labelled as A0-A5 and AA.</Paragraph> <Paragraph position="3"> These labels have different semantics for each verb as specified in the PropBank Frame files. In addition, there are also 13 types of adjuncts labelled as AM-XXX where XXX specifies the adjunct type.</Paragraph> <Paragraph position="4"> In some cases, an argument may span over different parts of a sentence, the label C-XXX is used to specify the continuity of the arguments, as shown in the example below.</Paragraph> <Paragraph position="5"> [A1 The pearls] , [A0 I] [V said] , [C-A1 were left to my daughter-in-law].</Paragraph> <Paragraph position="6"> Moreover in some cases, an argument might be a relative pronoun that in fact refers to the actual agent outside the clause. In this case, the actual agent is labeled as the appropriate argument type, XXX, while the relative pronoun is instead labeled as R-XXX.</Paragraph> <Paragraph position="7"> For example, [A1 The pearls] [R-A1 which] [A0 I] [V left] , [A2 to my daughter-in-law] are fake.</Paragraph> <Paragraph position="8"> See the details of the definition in Kingsbury and Palmer (2002) and Carreras and M`arquez (2003).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 System Architecture </SectionTitle> <Paragraph position="0"> Our semantic role labeling system consists of two phases. The first phase finds a subset of arguments from all possible candidates. The goal here is to filter out as many as possible false argument candidates, while still maintaining high recall. The second phase focuses on identifying the types of those argument candidates. Since the number of candidates is much fewer, the second phase is able to use slightly complicated features to facilitate learning a better classifier. This section first introduces the learning system we use and then describes how we learn the classifiers in these two phases.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 SNoW Learning Architecture </SectionTitle> <Paragraph position="0"> The learning algorithm used is a variation of the Winnow update rule incorporated in SNoW (Roth, 1998; Roth and Yih, 2002), a multi-class classifier that is specifically tailored for large scale learning tasks. SNoW learns a sparse network of linear functions, in which the targets (argument border predictions or argument type predictions, in this case) are represented as linear functions over a common feature space. It incorporates several improvements over the basic Winnow multiplicative update rule.</Paragraph> <Paragraph position="1"> In particular, a regularization term is added, which has the effect of trying to separate the data with a thick separator (Grove and Roth, 2001; Hang et al., 2002). In the work presented here we use this regularization with a fixed parameter.</Paragraph> <Paragraph position="2"> Experimental evidence has shown that SNoW activations are monotonic with the confidence in the prediction. Therefore, it can provide a good source of probability estimation. We use softmax (Bishop, 1995) over the raw activation values as conditional probabilities, and also the score of the target. Specifically, suppose the number of classes is n, and the raw activation values of class i is acti.</Paragraph> <Paragraph position="3"> The posterior estimation for class i is derived by the following equation.</Paragraph> <Paragraph position="5"> The score plays an important role in different places. For example, the first phase uses the scores to decide which argument candidates should be filtered out. Also, the scores output by the second-phase classifier are used in the inference procedure to reason for the best global labeling.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 First Phase: Find Argument Candidates </SectionTitle> <Paragraph position="0"> The first phase is to predict the argument candidates of a given sentence that correspond to the active verb. Unfortunately, it turns out that it is difficult to predict the exact arguments accurately. Therefore, the goal here is to output a superset of the correct arguments by filtering out unlikely candidates.</Paragraph> <Paragraph position="1"> Specifically, we learn two classifiers, one to detect beginning argument locations and the other to detect end argument locations. Each multi-class classifier makes predictions over forty-three classes--thirty-two argument types, ten continuous argument types, and one class to detect not beginning/not end. Features used for these classifiers are: + Word feature includes the current word, two words before and two words after.</Paragraph> <Paragraph position="2"> + Part-of-speech tag (POS) feature includes the POS tags of all words in a window of size two.</Paragraph> <Paragraph position="3"> + Chunk feature includes the BIO tags for chunks of all words in a window of size two.</Paragraph> <Paragraph position="4"> + Predicate lemma & POS tag show the lemma form and POS tag of the active predicate.</Paragraph> <Paragraph position="5"> + Voice feature is the voice (active/passive) of the current predicate. This is extracted with a simple rule: a verb is identified as passive if it follows a to-be verb in the same phrase chunk and its POS tag is VBN(past participle) or it immediately follows a noun phrase.</Paragraph> <Paragraph position="6"> + Position feature describes if the current word is before or after the predicate.</Paragraph> <Paragraph position="7"> + Chunk pattern encodes the sequence of chunks from the current words to the predicate.</Paragraph> <Paragraph position="8"> + Clause tag indicates the boundary of clauses.</Paragraph> <Paragraph position="9"> + Clause path feature is a path formed from a semi-parsed tree containing only clauses and chunks. Each clause is named with the chunk preceding it. The clause path is the path from predicate to target word in the semi-parse tree.</Paragraph> <Paragraph position="10"> + Clause position feature is the position of the target word relative to the predicate in the semi-parse tree containing only clauses. There are four configurations - target word and predicate share the same parent, target word parent is an ancestor of predicate, predicate parent is an ancestor of target word, or otherwise.</Paragraph> <Paragraph position="11"> Because each argument consists of a single beginning and a single ending, these classifiers can be used to construct a set of potential arguments (by combining each predicted begin with each predicted end after it of the same type).</Paragraph> <Paragraph position="12"> Although this phase identifies typed arguments (i.e. labeled with argument types), the second phase will re-score each phrase using phrase-based classifiers - therefore, the goal of the first phase is simply to identify non-typed phrase candidates. In this task, we achieves 98:96% and 88:65% recall (overall, without verb) on the training and the development set, respectively. Because these are the only candidates passed to the second phase, the final system performance is upper-bounded by 88:65%.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Second Phase: Argument Classification </SectionTitle> <Paragraph position="0"> The second phase of our system assigns the final argument classes to (a subset) of the argument candidates supplied from the first phase. Again, the SNoW learning architecture is used to train a multi-class classifier to label each argument to one of the argument types, plus a special class--no argument (null). Training examples are created from the argu- null ment candidates supplied from the first phase using the following features: + Predicate lemma & POS tag, voice, position, clause Path, clause position, chunk pattern Same features as those in the first phase.</Paragraph> <Paragraph position="1"> + Word & POS tag from the argument, including the first,last,and head1 word and tag.</Paragraph> <Paragraph position="2"> + Named entity feature tells if the target argument is, embeds, overlaps, or is embedded in a named entity with its type.</Paragraph> <Paragraph position="3"> + Chunk tells if the target argument is, embeds, overlaps, or is embedded in a chunk with its type.</Paragraph> <Paragraph position="4"> + Lengths of the target argument, in the numbers of words and chunks separately.</Paragraph> <Paragraph position="5"> + Verb class feature is the class of the active predicate described in PropBank Frames.</Paragraph> <Paragraph position="6"> + Phrase type uses simple heuristics to identify the target argument as VP, PP, or NP.</Paragraph> <Paragraph position="7"> + Sub-categorization describes the phrase structure around the predicate. We separate the clause where the predicate is in into three parts--the predicate chunk, segments before and after the predicate, and use the sequence of phrase types of these three segments.</Paragraph> <Paragraph position="8"> + Baseline features identified not in the main verb chunk as AM-NEG and modal verb in the main verb chunk as AM-MOD.</Paragraph> <Paragraph position="9"> + Clause coverage describes how much of the local clause (from the predicate) is covered by the target argument.</Paragraph> <Paragraph position="10"> + Chunk pattern length feature counts the number of patterns in the argument.</Paragraph> <Paragraph position="11"> + Conjunctions join every pair of the above features as new features.</Paragraph> <Paragraph position="12"> + Boundary words & POS tag include two words/tags before and after the target argument. null + Bigrams are pairs of words/tags in the window from two words before the target to the first word of the target, and also from the last word to two words after the argument.</Paragraph> <Paragraph position="13"> 1We use simple rules to first decide if a candidate phrase type is VP, NP, or PP. The headword of an NP phrase is the right-most noun. Similarly, the left-most verb/proposition of a VP/PP phrase is extracted as the headword + Sparse collocation picks one word/tag from the two words before the argument, the first word/tag, the last word/tag of the argument, and one word/tag from the two words after the argument to join as features.</Paragraph> <Paragraph position="14"> Although the predictions of the second-phase classifier can be used directly, the labels of arguments in a sentence often violate some constraints. Therefore, we rely on the inference procedure to make the final predictions.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Inference via ILP </SectionTitle> <Paragraph position="0"> Ideally, if the learned classifiers are perfect, arguments can be labeled correctly according to the classifiers' predictions. In reality, labels assigned to arguments in a sentence often contradict each other, and violate the constraints arising from the structural and linguistic information. In order to resolve the conflicts, we design an inference procedure that takes the confidence scores of each individual argument given by the second-phase classifier as input, and outputs the best global assignment that also satisfies the constraints. In this section we first introduce the constraints and the inference problem in the semantic role labeling task. Then, we demonstrate how we apply integer linear programming(ILP) to reason for the global label assignment.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Constraints over Argument Labeling </SectionTitle> <Paragraph position="0"> Formally, the argument classifier attempts to assign labels to a set of arguments, S1:M, indexed from 1 to M. Each argument Si can take any label from a set of argument labels, P, and the indexed set of arguments can take a set of labels, c1:M 2 PM.</Paragraph> <Paragraph position="1"> If we assume that the classifier returns a score, score(Si = ci), corresponding to the likelihood of seeing label ci for argument Si, then, given a sentence, the unaltered inference task is solved by maximizing the overall score of the arguments,</Paragraph> <Paragraph position="3"> In the presence of global constraints derived from linguistic information and structural considerations, our system seeks for a legitimate labeling that maximizes the score. Specifically, it can be viewed as the solution space is limited through the use of a filter function, F, that eliminates many argument labelings from consideration. It is interesting to contrast this with previous work that filters individual phrases (see (Carreras and M`arquez, 2003)). Here, we are concerned with global constraints as well as constraints on the arguments. Therefore, the final labeling becomes</Paragraph> <Paragraph position="5"> The filter function used considers the following constraints: null 1. Arguments cannot cover the predicate except those that contain only the verb or the verb and the following word.</Paragraph> <Paragraph position="6"> 2. Arguments cannot overlap with the clauses (they can be embedded in one another). 3. If a predicate is outside a clause, its arguments cannot be embedded in that clause.</Paragraph> <Paragraph position="7"> 4. No overlapping or embedding arguments. 5. No duplicate argument classes for A0-A5,V. 6. Exactly one V argument per verb.</Paragraph> <Paragraph position="8"> 7. If there is C-V, then there should be a sequence of consecutive V, A1, and C-V pattern. For example, when split is the verb in &quot;split it up&quot;, the A1 argument is &quot;it&quot; and C-V argument is &quot;up&quot;.</Paragraph> <Paragraph position="9"> 8. If there is an R-XXX argument, then there has to be an XXX argument. That is, if an argument is a reference to some other argument XXX, then this referenced argument must exist in the sentence.</Paragraph> <Paragraph position="10"> 9. If there is a C-XXX argument, then there has to be an XXX argument; in addition, the C-XXX argument must occur after XXX. This is stricter than the previous rule because the order of appearance also needs to be considered.</Paragraph> <Paragraph position="11"> 10. Given the predicate, some argument classes are illegal (e.g. predicate 'stalk' can take only A0 or A1). This linguistic information can be found in PropBank Frames.</Paragraph> <Paragraph position="12"> We reformulate the constraints as linear (in)equalities by introducing indicator variables. The optimization problem (Eq. 2) is solved using ILP.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Using Integer Linear Programming </SectionTitle> <Paragraph position="0"> As discussed previously, a collection of potential arguments is not necessarily a valid semantic labeling since it must satisfy all of the constraints. In this context, inference is the process of finding the best (according to Equation 1) valid semantic labels that satisfy all of the specified constraints. We take a similar approach that has been previously used for entity/relation recognition (Roth and Yih, 2004), and model this inference procedure as solving an ILP.</Paragraph> <Paragraph position="1"> An integer linear program(ILP) is basically the same as a linear program. The cost function and the (in)equality constraints are all linear in terms of the variables. The only difference in an ILP is the variables can only take integers as their values. In our inference problem, the variables are in fact binary.</Paragraph> <Paragraph position="2"> A general binary integer programming problem can be stated as follows.</Paragraph> <Paragraph position="3"> Given a cost vector p 2 <d, a set of variables, z = (z1;:::;zd) and cost matrices C1 2 <t1 PS <d;C2 2<t2PS<d , where t1 and t2 are the numbers of inequality and equality constraints and d is the number of binary variables. The ILP solution z/ is the vector that maximizes the cost function,</Paragraph> <Paragraph position="5"> where b1;b2 2<d, and for all z 2 z, z 2f0;1g.</Paragraph> <Paragraph position="6"> To solve the problem of Equation 2 in this setting, we first reformulate the original cost functionP</Paragraph> <Paragraph position="8"> eral binary variables, and then represent the filter function F using linear inequalities and equalities.</Paragraph> <Paragraph position="9"> We set up a bijection from the semantic labeling to the variable set z. This is done by setting z to a set of indicator variables. Specifically, let zic = [Si = c] be the indicator variable that represents whether or not the argument type c is assigned to Si, and let pic = score(Si = c). Equation 1 can then be written as an ILP cost function as</Paragraph> <Paragraph position="11"> which means that each argument can take only one type. Note that this new constraint comes from the variable transformation, and is not one of the constraints used in the filter function F.</Paragraph> <Paragraph position="12"> Constraints 1 through 3 can be evaluated on a perargument basis - the sake of efficiency, arguments that violate these constraints are eliminated even before given the second-phase classifier. Next, we show how to transform the constraints in the filter function into the form of linear (in)equalities over z, and use them in this ILP setting.</Paragraph> <Paragraph position="13"> Constraint 4: No overlapping or embedding If arguments Sj1;:::;Sjk occupy the same word in a sentence, then this constraint restricts only one arguments to be assigned to an argument type. In other words, k ! 1 arguments will be the special class null, which means the argument candidate is not a legitimate argument. If the special class null is represented by the symbol `, then for every set of such arguments, the following linear equality represents this constraint.</Paragraph> <Paragraph position="15"> Constraint 5: No duplicate argument classes Within the same sentence, several types of arguments cannot appear more than once. For example, a predicate can only take one A0. This constraint can be represented using the following inequality.</Paragraph> <Paragraph position="17"> Constraint 6: Exactly one V argument For each verb, there is one and has to be one V argument, which represents the active verb. Similarly, this constraint can be represented by the following equality.</Paragraph> <Paragraph position="19"> Constraint 7: V-A1-C-V pattern This constraint is only useful when there are three consecutive candidate arguments in a sentence. Suppose arguments Sj1;Sj2;Sj3 are consecutive. If Sj3 is C-V, then Sj1 and Sj2 have to be V and A1, respectively. This if-then constraint can be represented by the following two linear inequalities.</Paragraph> <Paragraph position="20"> zj3C-V , zj1V; and zj3C-V , zj2A1 Constraint 8: R-XXX arguments Suppose the referenced argument type is A0 and the reference type is R-A0. The linear inequalities that represent this constraint are:</Paragraph> <Paragraph position="22"> If there are reference argument pairs, then the total number of inequalities needed is M.</Paragraph> <Paragraph position="23"> Constraint 9: C-XXX arguments This constraint is similar to the reference argument constraints. The difference is that the continued argument XXX has to occur before C-XXX. Assume that the argument pair is A0 and C-A0, and argument Sji appears before Sjk if i * k. The linear inequalities that represent this constraint are:</Paragraph> <Paragraph position="25"> Constraint 10: Illegal argument types Given a specific verb, some argument types should never occur. For example, most verbs don't have arguments A5. This constraint is represented by summing all the corresponding indicator variables to be 0.</Paragraph> <Paragraph position="27"> Using ILP to solve this inference problem enjoys several advantages. Linear constraints are very general, and are able to represent many types of constraints. Previous approaches usually rely on dynamic programming to resolve non overlapping/embedding constraints (i.e., Constraint 4) when the data is sequential, but are unable to handle other constraints. The ILP approach is flexible enough to handle constraints regardless of the structure of the data. Although solving an ILP problem is NP-hard, with the help of todays commercial numerical packages, this problem can usually be solved very fast in practice. For instance, it only takes about 10 minutes to solve the inference problem for 4305 sentences on a Pentium-III 800 MHz machine in our experiments. Note that ordinary search methods (e.g., beam search) are not necessarily faster than solving an ILP problem and do not guarantee the optimal solution.</Paragraph> </Section> </Section> class="xml-element"></Paper>