File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1115_metho.xml
Size: 21,516 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1115"> <Title>Using String-Kernels for Learning Semantic Parsers</Title> <Section position="4" start_page="913" end_page="913" type="metho"> <SectionTitle> CLANG MR. </SectionTitle> <Paragraph position="0"> ably to other existing systems and is particularly robust to noise.</Paragraph> </Section> <Section position="5" start_page="913" end_page="915" type="metho"> <SectionTitle> 2 Semantic Parsing </SectionTitle> <Paragraph position="0"> We call the process of mapping natural language (NL) utterances into their computer-executable meaning representations (MRs) as semantic parsing. These MRs are expressed in formal languages which we call meaning representation languages (MRLs). We assume that all MRLs have deterministic context free grammars, which is true for almost all computer languages. This ensures that every MR will have a unique parse tree. A learning system for semantic parsing is given a training corpus of NL sentences paired with their respective MRs from which it has to induce a semantic parser which can map novel NL sentences to their correct MRs.</Paragraph> <Paragraph position="1"> Figure 1 shows an example of an NL sentence and its MR from the CLANG domain. CLANG (Chen et al., 2003) is the standard formal coach language in which coaching advice is given to soccer agents which compete on a simulated soccer field in the RoboCup 1 Coach Competition. In the MR of the example, bpos stands for &quot;ball position&quot;. null The second domain we have considered is the GEOQUERY domain (Zelle and Mooney, 1996) which is a query language for a small database of about 800 U.S. geographical facts. Figure 2 shows anNLqueryanditsMRforminafunctionalquery language. The parse of the functional query language is also shown along with the involved productions. This example is also used later to illustrate how our system does semantic parsing. The MR in the functional query language can be read as if processing a list which gets modified by various functions. From the innermost expression going outwards it means: the state of Texas, the list containing all the states next to the state of Texas and the list of all the rivers which flow through these states. This list is finally returned as the answer. null in a functional query language with its parse tree. KRISP does semantic parsing using the notion of a semantic derivation of an NL sentence. In the following subsections, we define the semantic derivation of an NL sentence and its probability. The task of semantic parsing then is to find the most probable semantic derivation of an NL sentence. In section 3, we describe how KRISP learns the string classifiers that are used to obtain the probabilities needed in finding the most probable semantic derivation.</Paragraph> <Section position="1" start_page="913" end_page="914" type="sub_section"> <SectionTitle> 2.1 Semantic Derivation </SectionTitle> <Paragraph position="0"> We define a semantic derivation, D, of an NL sentence, s, as a parse tree of an MR (not necessarily the correct MR) such that each node of the parse tree also contains a substring of the sentence in addition to a production. We denote nodes of the derivation tree by tuples (pi,[i..j]), where pi is its productionand[i..j]standsforthesubstrings[i..j] of s (i.e. the substring from the ith word to the jth word), and we say that the node or its production covers the substring s[i..j]. The substrings covered by the children of a node are not allowed to overlap, and the substring covered by the parent must be the concatenation of the substrings covered by its children. Figure 3 shows a semantic derivation of the NL sentence and the MR parse whichwereshowninfigure2. Thewordsarenumbered according to their position in the sentence. Instead of non-terminals, productions are shown in the nodes to emphasize the role of productions in semantic derivations.</Paragraph> <Paragraph position="1"> Sometimes, the children of an MR parse tree</Paragraph> <Paragraph position="3"> node may not be in the same order as are the sub-strings of the sentence they should cover in a semantic derivation. For example, if the sentence was &quot;Through the states that border Texas which rivers run?&quot;, which has the same MR as the sentence in figure 3, then the order of the children of the node &quot;RIVER - TRAVERSE(STATE)&quot; would need to be reversed. To accommodate this, a semantic derivation tree is allowed to contain MR parse tree nodes in which the children have been permuted.</Paragraph> <Paragraph position="4"> Note that given a semantic derivation of an NL sentence, it is trivial to obtain the corresponding MR simply as the string generated by the parse.</Paragraph> <Paragraph position="5"> Since children nodes may be permuted, this step also needs to permute them back to the way they should be according to the MRL productions. If a semantic derivation gives the correct MR of the NL sentence, then we call it a correct semantic derivation otherwise it is an incorrect semantic derivation.</Paragraph> </Section> <Section position="2" start_page="914" end_page="915" type="sub_section"> <SectionTitle> 2.2 Most Probable Semantic Derivation </SectionTitle> <Paragraph position="0"> Let Ppi(u) denote the probability that a production pi of the MRL grammar covers the NL substring u. In other words, the NL substring u expresses the semantic concept of a production pi with probability Ppi(u). In the next subsection we will describehow KRISP obtainstheseprobabilitiesusing string-kernel based SVM classifiers. Assuming these probabilities are independent of each other,</Paragraph> <Paragraph position="2"> The task of the semantic parser is to find the most probable derivation of a sentence s. This task can be recursively performed using the notion of a partial derivation En,s[i..j], which stands for a subtree of a semantic derivation tree with n as the left-hand-side (LHS) non-terminal of the root production and which covers s from index i to j. For example, the subtree rooted at the node &quot;(STATE - NEXT TO(STATE),[5..9])&quot; in the derivation shown in figure 3 is a partial derivation which would be denoted as ESTATE,s[5..9].</Paragraph> <Paragraph position="3"> Note that the derivation D of sentence s is then simply Estart,s[1..|s|], where start is the start symbol of the MRL's context free grammar, G.</Paragraph> <Paragraph position="4"> Our procedure to find the most probable partial derivation E[?]n,s[i..j] considers all possible sub-trees whose root production has n as its LHS non-terminal and which cover s from index i to j.</Paragraph> <Paragraph position="5"> Mathematically, the most probable partial derivation E[?]n,s[i..j] is recursively defined as:</Paragraph> <Paragraph position="7"> where partition(s[i..j],t) is a function which returns the set of all partitions of s[i..j] with t elements including their permutations. A partition of a substring s[i..j] with t elements is a t[?]tuple containing t non-overlapping substrings of s[i..j] which give s[i..j] when concatenated.</Paragraph> <Paragraph position="8"> For example, (&quot;the states bordering&quot;, &quot;Texas ?&quot;) is a partition of the substring &quot;the states bordering Texas ?&quot; with 2 elements. The proceduremakeTree(pi,(p1,..,pt))constructsapartial null derivation tree by making pi as its root production and making the most probable partial derivation trees found through the recursion as children sub-trees which cover the substrings according to the partition (p1,..,pt).</Paragraph> <Paragraph position="9"> The most probable partial derivation E[?]n,s[i..j] is found using the above equation by trying all productions pi = n - n1..nt in G which have n as the LHS, and all partitions with t elements of the substring s[i..j] (n1 to nt are right-hand-side (RHS) non-terminals of pi, terminals do not play any role in this process and are not shown for simplicity). The most probable partial derivation E[?]STATE,s[5..9] for the sentence shown in figure 3 will be found by trying all the productions in the grammar with STATE as the LHS, for example, one of them being &quot;STATE - NEXT TO STATE&quot;. Thenforthissampleproduction,allpartitions, (p1,p2), of the substring s[5..9] with two elements will be considered, and the most probable derivations E[?]NEXT TO,p1 and E[?]STATE,p2 will be found recursively. The recursion reaches base cases when the productions which have n on the LHS do not have any non-terminal on the RHS or when the substring s[i..j] becomes smaller than the length t.</Paragraph> <Paragraph position="10"> According to the equation, a production pi [?] G and a partition (p1,..,pt) [?] partition(s[i..j],t) will be selected in constructing the most probable partial derivation. These will be the ones which maximize the product of the probability of pi coveringthesubstrings[i..j] withtheproductofprobabilities of all the recursively found most probable partial derivations consistent with the partition (p1,..,pt).</Paragraph> <Paragraph position="11"> A naive implementation of the above recursion is computationally expensive, but by suitably extending the well known Earley's context-free parsing algorithm (Earley, 1970), it can be implemented efficiently. The above task has some resemblance to probabilistic context-free grammar (PCFG) parsing for which efficient algorithms are available (Stolcke, 1995), but we note that our task of finding the most probable semantic derivation differs from PCFG parsing in two important ways.</Paragraph> <Paragraph position="12"> First, the probability of a production is not independent of the sentence but depends on which sub-string of the sentence it covers, and second, the leaves of the tree are not individual terminals of the grammar but are substrings of words of the NL sentence. The extensions needed for Earley's algorithm are straightforward and are described in detail in (Kate, 2005) but due to space limitation we do not describe them here. Our extended Earley's algorithm does a beam search and attempts to find theo (a parameter) most probable semantic derivationsofanNLsentencesusingtheprobabilities Ppi(s[i..j]). To make this search faster, it uses a threshold, th, to prune low probability derivation trees.</Paragraph> </Section> </Section> <Section position="6" start_page="915" end_page="917" type="metho"> <SectionTitle> 3 KRISP's Training Algorithm </SectionTitle> <Paragraph position="0"> In this section, we describe how KRISP learns the classifiers which give the probabilities Ppi(u) needed for semantic parsing as described in the previous section. Given the training corpus of NL sentences paired with their MRs {(si,mi)|i = 1..N}, KRISP first parses the MRs using the MRL grammar, G. We represent the parse of MR, mi, by parse(mi).</Paragraph> <Paragraph position="1"> Figure 4 shows pseudo-code for KRISP's training algorithm. KRISP learns a semantic parser iteratively, each iteration improving upon the parser learned in the previous iteration. In each iteration, for every production pi of G, KRISP collects positive and negative example sets. In the first iteration, the set P(pi) of positive examples for production pi contains all sentences, si, such that parse(mi) uses the productionpi. The set of negative examples,N(pi), for productionpi includes all of the remaining training sentences. Using these positive and negative examples, an SVM classifier 2, Cpi, is trained for each production pi using a normalized string subsequence kernel. Following the framework of Lodhi et al. (2002), we define a kernel between two strings as the number of commonsubsequencestheyshare. Onedifference, however, is that their strings are over characters while our strings are over words. The more the two strings share, the greater the similarity score will be.</Paragraph> <Paragraph position="2"> Normally,SVMclassifiersonlypredicttheclass of the test example but one can obtain class probability estimates by mapping the distance of the example from the SVM's separating hyperplane to the range [0,1] using a learned sigmoid function (Platt, 1999). The classifier Cpi then gives us the probabilities Ppi(u). We represent the set of these classifiers by C = {Cpi|pi [?] G}.</Paragraph> <Paragraph position="3"> Next, using these classifiers, the extended Earley's algorithm, which we call EX-TENDED EARLEY in the pseudo-code, is invoked to obtain the o best semantic derivations for each of the training sentences. The procedure getMR returns the MR for a semantic derivation. At this point, for many training sentences, the resulting most-probable semantic derivation may not give the correct MR. Hence, next, the system collects more refined positive and negative examples to improve the result in the next iteration. It function TRAIN KRISP(training corpus {(si,mi)|i = 1..N}, MRL grammar G) for each pi [?]G // collect positive and negative examples for the first iteration for i = 1 to N do if pi is used in parse(mi) then include si in P(pi) else include si in N(pi) for iteration = 1 to MAX ITR do for each pi [?]G do</Paragraph> <Paragraph position="5"> of the obtained o derivations give the correct MR. But as will be described shortly, the most probable derivation which gives the correct MR is needed to collect positive and negative examples for the next iteration. Hence in these cases, a version of the extended Earley's algorithm, EX-TENDED EARLEY CORRECT, is invoked which also takes the correct MR as an argument and returns the best o derivations it finds, all of which give the correct MR. This is easily done by making sure all subtrees derived in the process are present in the parse of the correct MR.</Paragraph> <Paragraph position="6"> From these derivations, positive and negative examples are collected for the next iteration. Positive examples are collected from the most probable derivation which gives the correct MR, figure 3 showed an example of a derivation which gives the correct MR. At each node in such a derivation, the substring covered is taken as a positive example for its production. Negative examples are collected from those derivations whose probability is higher than the most probable correct derivation but which do not give the correct MR. Figure 5 shows an example of an incorrect derivation. Here the function &quot;next to&quot; is missing from the MR it produces. The following procedure is used to collect negative examples from incorrect derivations. The incorrect derivation and the most probable correct derivation are traversed simultaneously starting from the root using breadth-first traversal. The first nodes where their productions differ is detected, and all of the words covered by the these nodes (in both derivations) are marked. In the correct and incorrect derivations shown in figures 3 and 5 respectively, the first nodes where the productions differ are &quot;(STATE - NEXT TO(STATE), [5..9])&quot; and &quot;(STATE - STATEID, [8..9])&quot;. Hence, the union of words covered by them: 5 to 9 (&quot;the states bordering Texas?&quot;), will be marked. For each of these marked words, the procedure considers all of the productions which cover it in the two derivations. The nodes of the productions which cover a marked word in the incorrect derivation but not in the correct derivation are used to collect negative examples. In the example, the node &quot;(TRAVERSE-traverse,[1..7])&quot; will be used to collect a negative example (i.e. the words 1 to 7 ''which rivers run through the states bordering&quot; will be a negative example for the production TRAVERSE-traverse) because the production covers the marked words &quot;the&quot;, &quot;states&quot; and &quot;bordering&quot; in the incorrect derivation but not in the correct derivation. With this as a negative example, hopefully in the next iteration, the probability of this derivation will decrease significantly and drop below the probability of the correct derivation.</Paragraph> <Paragraph position="7"> In each iteration, the positive examples from the previous iteration are first removed so that new positive examples which lead to better correct derivations can take their place. However, negative examples are accumulated across iterations for better accuracy because negative examples from each iteration only lead to incorrect derivations and it is always good to include them. To further increase the number of negative examples, every positive example for a production is also included as a negative example for all the other productions having the same LHS.</Paragraph> <Paragraph position="8"> After a specified number of MAX ITR iterations,</Paragraph> <Paragraph position="10"> bordering Texas?&quot; which gives the incorrect MR answer(traverse(stateid(texas))).</Paragraph> <Paragraph position="11"> the trained classifiers from the last iteration are returned. Testing involves using these classifiers to generate the most probable derivation of a test sentence as described in the subsection 2.2, and returning its MR.</Paragraph> <Paragraph position="12"> The MRL grammar may contain productions corresponding to constants of the domain, for e.g., state names like &quot;STATEID -'texas'&quot;, or river names like &quot;RIVERID -'colorado'&quot; etc. Our system allows the user to specify such productions as constant productions giving the NL substrings, called constant substrings, which directly relate to them. For example, the user may give &quot;Texas&quot; as the constant substring for the production &quot;STATEID - 'texas'. Then KRISP does not learn classifiers for these constant productions and instead decides if they cover a substring of the sentence or not by matching it with the provided constant substrings.</Paragraph> </Section> <Section position="7" start_page="917" end_page="918" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="917" end_page="918" type="sub_section"> <SectionTitle> 4.1 Methodology </SectionTitle> <Paragraph position="0"> KRISP was evaluated on CLANG and GEOQUERY domains as described in section 2. The CLANG corpus was built by randomly selecting 300 pieces of coaching advice from the log files of the 2003 RoboCup Coach Competition. These formal advice instructions were manually translated into English (Kate et al., 2005). The GEOQUERY corpus contains 880 English queries collected from undergraduatesand fromreal usersofaweb-based interface (Tang and Mooney, 2001). These were manually translated into their MRs. The average length of an NL sentence in the CLANG corpus is 22.52 words while in the GEOQUERY corpus it is 7.48 words, which indicates that CLANG is the harder corpus. The average length of the MRs is 13.42 tokens in the CLANG corpus while it is 6.46 tokens in the GEOQUERY corpus.</Paragraph> <Paragraph position="1"> KRISP was evaluated using standard 10-fold cross validation. For every test sentence, only the best MR corresponding to the most probable semantic derivation is considered for evaluation, and its probability is taken as the system's confidence in that MR. Since KRISP uses a threshold, th, to prune low probability derivation trees, it sometimes may fail to return any MR for a test sentence. We computed the number of test sentences for which KRISP produced MRs, and the number of these MRs that were correct. For CLANG, an output MR is considered correct if and only if it exactly matches the correct MR. For GEOQUERY, an output MR is considered correct if and only if the resulting query retrieves the same answer as the correct MR when submitted to the database.</Paragraph> <Paragraph position="2"> Performance was measured in terms of precision (the percentage of generated MRs that were correct) and recall (the percentage of all sentences for which correct MRs were obtained).</Paragraph> <Paragraph position="3"> In our experiments, the threshold th was fixed to 0.05 and the beam size o was 20. These parameters were found through pilot experiments.</Paragraph> <Paragraph position="4"> Themaximumnumberofiterations(MAX ITR)required was only 3, beyond this we found that the system only overfits the training corpus.</Paragraph> <Paragraph position="5"> We compared our system's performance with the following existing systems: the string and tree versions of SILT (Kate et al., 2005), a system that learns transformation rules relating NL phrases to MRL expressions; WASP (Wong and Mooney, 2006), a system that learns transformation rules using statistical machine translation techniques; SCISSOR (Ge and Mooney, 2005), a system that learns an integrated syntactic-semantic parser; and CHILL (Tang and Mooney, 2001) an ILP-based semantic parser. We also compared with the CCG-based semantic parser by Zettlemoyer et al.</Paragraph> <Paragraph position="6"> (2005), but their results are available only for the GEO880 corpus and their experimental set-up is also different from ours. Like KRISP, WASP and SCISSOR also give confidences to the MRs they generate which are used to plot precision-recall curves by measuring precisions and recalls at vari- null ous confidence levels. The results of the other systems are shown as points on the precision-recall graph.</Paragraph> </Section> </Section> class="xml-element"></Paper>