File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1012_metho.xml
Size: 19,644 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1012"> <Title>Using LTAG Based Features in Parse Reranking</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Lexicalized Tree Adjoining Grammar </SectionTitle> <Paragraph position="0"> In this section, we give a brief introduction to the Lexicalized Tree Adjoining Grammar (more details can be found in (Joshi and Schabes, 1997)). In LTAG, each word is associated with a set of elementary trees. Each elementary tree represents a possible tree structure for the word. There are two kinds of elementary trees, initial trees and auxiliary trees.</Paragraph> <Paragraph position="1"> Elementary trees can be combined through two operations, substitution and adjunction. Substitution is used to attach an initial tree, and adjunction is used to attach an auxiliary tree. In addition to adjunction, we also use sister adjunction as de ned in the LTAG statistical parser described in (Chiang, 2000).1 The tree resulting from the combination of elementary trees is is called a derived tree. The tree that records the history of how a derived tree is built from the elementary trees is called a derivation tree.2 We illustrate the LTAG formalism using an example. null Example 1: Pierre Vinken will join the board as a non-executive director.</Paragraph> <Paragraph position="2"> The derived tree for Example 1 is shown in Fig. 1 (we omit the POS tags associated with each word to save space), and Fig. 2 shows the elementary trees for each word in the sentence. Fig. 3 is the derivation tree (the history of tree combinations). One of 1Adjunction is used in the case where both the root node and the foot node appear in the Treebank tree. Sister adjunction is used in generating modi er sub-trees as sisters to the head, e.g in basal NPs.</Paragraph> <Paragraph position="3"> 2Each node hni in the derivation tree is an elementary tree name along with the location n in the parent elementary tree where is inserted. The location n is the Gorn tree address (see the properties of LTAG is that it factors recursion in clause structure from the statement of linguistic constraints, thus making these constraints strictly local. For example, in the derivation tree of Examples 1, 1(join) and 2(V inken) are directly connected whether there is an auxiliary tree 2(will) or not.</Paragraph> <Paragraph position="4"> We will show how this property affects our rede ned tree kernel later in this paper. In our experiments in this paper, we only use LTAG grammars where each elementary tree is lexicalized by exactly one word (terminal symbol) on the frontier.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Parse Reranking </SectionTitle> <Paragraph position="0"> In recent years, reranking techniques have been successfully used in statistical parsers to rerank the output of history-based models (Black et al., 1993). In this paper, we will use the LTAG based features to improve the performance of reranking. Our motivations for using LTAG based features for reranking are the following: Unlike the generative model, it is trivial to incorporate features of various kinds in a reranking setting. Furthermore the nature of reranking makes it possible to use global features, analysis for the sentence in Example 1.</Paragraph> <Paragraph position="1"> which allow us to combine features that are dened on arbitrary sub-trees in the parse tree and features de ned on a derivation tree.</Paragraph> <Paragraph position="2"> Several hand-crafted and arbitrary features have been exploited in the statistical parsing task, especially when parsing the WSJ Penn Treebank dataset where performance has been nely tuned over the years. Showing a positive contribution in this task will be a convincing test for the use of LTAG based features.</Paragraph> <Paragraph position="3"> The parse reranking dataset is well established.</Paragraph> <Paragraph position="4"> We use the dataset de ned in (Collins, 2000).</Paragraph> <Paragraph position="5"> In (Collins, 2000), two reranking algorithms were proposed. One was based on Markov Random Fields, and the other was based on the Boosting algorithm. In both these models, the loss functions were computed directly on the feature space. Furthermore, a rich feature set was introduced that was speci cally selected by hand to target the limitations of generative models in statistical parsing.</Paragraph> <Paragraph position="6"> In (Collins and Duffy, 2002), the Voted Perceptron algorithm was used for parse reranking. The</Paragraph> <Paragraph position="8"> tary tree has a unique node address using the Gorn notation. 0 is the root with daughters 00, 01, and so on recursively, e.g. rst daughter 01 is 010.</Paragraph> <Paragraph position="9"> tree kernel was used to compute the number of common sub-trees of two parse trees. The features used by this tree kernel contains all the hand selected features of (Collins, 2000). It is worth mentioning that the f-scores reported in (Collins and Duffy, 2002) are about 1% less than the results in (Collins, 2000). In (Shen and Joshi, 2003), a SVM based reranking algorithm was proposed. In that paper, the notion of preference kernels was introduced to solve the reranking problem. Two distinct kernels, the tree kernel and the linear kernel were used with preference kernels.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Using LTAG Based Features </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Motivation </SectionTitle> <Paragraph position="0"> While the tree kernel is an easy way to compute similarity between two parse trees, it takes too many linguistically meaningless sub-trees into consideration.</Paragraph> <Paragraph position="1"> Let us consider the example sentence in Example 1. The parse tree, or derived tree, for this sentence is shown in Fig. 1. Fig. 5 shows one of the linguistically meaningless sub-trees. The number of meaningless sub-trees is a misleading measure for discriminating good parse trees from bad. Furthermore, the number of meaningless sub-trees is far greater than the number of useful sub-trees. This limits both ef ciency and accuracy on the test data. The use of unwanted sub-trees greatly increases the hypothesis space of a learning machine, and thus decreases the expected accuracy on test data. In this work, we consider the hypothesis that linguistically meaningful sub-trees reveal correlations of interest and therefore are useful in stochastic models.</Paragraph> <Paragraph position="2"> We notice that each sub-tree of a derivation tree is linguistically meaningful because it represents a valid sub-derivation. We claim that derivation trees provide a more accurate measure of similarity between two parses. This is one of the motivations for applying tree kernels to derivation trees. Note that the use of features on derivation trees is different from the use of features on dependency graphs, derivation trees include many complex patterns of tree names and attachment sites and can represent word to word dependencies that are not possible in traditional dependency graphs.</Paragraph> <Paragraph position="3"> For example, the derivation tree for Example 1 with and without optional modi ers such as 4(as) are minimally different. In contrast, in derived (parse) trees, there is an extra VP node which changes quite drastically the set of sub-trees with and without the PP modi er. In addition, using only sub-trees from the derived tree, we cannot represent a common sub-tree that contains only the words Vinken and join since this would lead to a discontinuous sub-tree. However, LTAG based features can represent such cases trivially.</Paragraph> <Paragraph position="4"> The comparison between (Collins, 2000) and (Collins and Duffy, 2002) in x3 shows that it is hard to add new features to improve performance. Our hypothesis is that the LTAG based features provide a novel set of abstract features that complement the hand selected features from (Collins, 2000) and the LTAG based features will help improve performance in parse reranking.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Extracting Derivation Trees </SectionTitle> <Paragraph position="0"> Before we can use LTAG based features we need to obtain an LTAG derivation tree for each parse tree under consideration by the reranker. Our solution is to extract elementary trees and the derivation tree simultaneously from the parse trees produced by an n-best statistical parser. Our training and test data consists of n-best output from the Collins parser (see (Collins, 2000) for details on the dataset).</Paragraph> <Paragraph position="1"> Since the Collins parser uses a lexicalized context-free grammar as a basis for its statistical model, we obtain parse trees that are of the type shown in Fig.</Paragraph> <Paragraph position="2"> 6. From this tree we extract elementary trees and derivation trees by recursively traversing the spine of the parse tree. The spine is the path from a non-terminal lexicalized by a word to the terminal symbol on the frontier equal to that word. Every sub-tree rooted at a non-terminal lexicalized by a different word is excised from the parse tree and recorded into parser. Each non-terminal is lexicalized by the parsing model. -A marks arguments recovered by the parser.</Paragraph> <Paragraph position="3"> the derivation tree as a substitution. Repeated non-terminals on the spine (e.g. VP(join) ::: VP(join) in Fig. 6) are excised along with the sub-trees hanging off of it and recorded into the derivation tree as an adjunction. The only other case is those sub-trees rooted at non-terminals that are attached to the spine. These sub-trees are excised and recorded into the derivation tree as cases of sister adjunction. Each sub-tree excised is recursively analyzed with this method, split up into elementary trees and then recorded into the derivation tree. The output of our algorithm for the input parse tree in Fig. 6 is shown in Fig. 2 and Fig. 3. Our algorithm is similar to the derivation tree extraction explained in (Chiang, 2000), except we extract our LTAG from n-best sets of parse trees, while in (Chiang, 2000) the LTAG is extracted from the Penn Treebank.3 For other techniques for LTAG grammar extraction see (Xia, 2001; Chen and Vijay-Shanker, 2000).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Using Derivation Trees </SectionTitle> <Paragraph position="0"> In this paper, we have described two models to employ derivation trees. Model 1 uses tree kernels on derivation trees. In order to make the tree kernel more lexicalized, we extend the original de nition of the tree kernel, which we will describe below.</Paragraph> <Paragraph position="1"> Model 2 abstracts features from derivation trees and uses them with a linear kernel.</Paragraph> <Paragraph position="2"> In Model 1, we combine the SVM results of the tree kernel on derivation trees with the SVM results given by a linear kernel based on features on the derived trees.</Paragraph> <Paragraph position="3"> 3Also note that the path from the root node to the foot node in auxiliary trees can be greater than one (for trees with S roots). In Model 2, the vector space of the linear kernel consists of both LTAG based features de ned on the derived trees and features de ned on the derivation tree. The following LTAG features have been used in Model 2.</Paragraph> <Paragraph position="4"> Elementary tree. Each node in the derivation tree is used as a feature.</Paragraph> <Paragraph position="5"> Bigram of parent and its child. Each pair of parent elementary tree and child elementary tree, as well as the type of operation (substitution, adjunction or sister adjunction) and the Gorn address on parent (see Fig. 4) is used as a feature.</Paragraph> <Paragraph position="6"> Lexicalized elementary tree. Each elementary tree associated with its lexical item is used as a feature.</Paragraph> <Paragraph position="7"> Lexicalized bigram. In Bigram of parent and its child, each elementary tree is lexicalized (we use closed class words, e.g. adj, adv, prep, etc. but not noun or verb).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Lexicalized Tree Kernel </SectionTitle> <Paragraph position="0"> In (Collins and Duffy, 2001), the notion of a tree kernel is introduced to compute the number of common sub-trees of two parse trees. For two parse trees, p1 and p2, the tree kernel Tree(p1;p2) is de ned as:</Paragraph> <Paragraph position="2"> The recursive function T is de ned as follows: If n1 and n2 have the same bracketing tag (e.g. S, NP, VP, :::) and the same number of children,</Paragraph> <Paragraph position="4"> where, nki is the ith child of the node nk, is a weight coef cient used to control the importance of large sub-trees and 0 < 1.</Paragraph> <Paragraph position="5"> If n1 and n2 have the same bracketing tag but different number of children, T(n1;n2) = . If they don't have the same bracketing tag, T(n1;n2) = 0.</Paragraph> <Paragraph position="6"> In (Collins and Duffy, 2002), lexical items are all located at the leaf nodes of parse trees. Therefore</Paragraph> <Paragraph position="8"> its decomposition into a pattern, ptn(n) and corresponding vector of lexical information, lex(n).</Paragraph> <Paragraph position="9"> sub-trees that do not contain any leaf node are not lexicalized. Furthermore, due to the introduction of parameter , lexical information is almost ignored for sub-trees whose root node is not close to the leaf nodes, i.e. sub-trees rooted at S node.</Paragraph> <Paragraph position="10"> In order to make the tree kernel more lexicalized, we associate each node with a lexical item. For example, Fig. 7 shows a lexicalized sub-tree and its decomposition into features. As shown in Fig. 7 the lexical information lex(t) extracted from the lexicalized tree consists of words from the root and its immediate children. This is because we wish to ignore irrelevant lexicalizations such as NP(board) in Fig. 7.</Paragraph> <Paragraph position="11"> A lexicalized sub-tree rooted on node n is split into two parts. One is the pattern tree of n, ptn(n).</Paragraph> <Paragraph position="12"> The other is the vector of lexical information of n, lex(n), which contains the lexical items of the root node and the children of the root.</Paragraph> <Paragraph position="13"> For two tree nodes n1 and n2, the recursive function LT(n1;n2) used to compute the lexicalized tree kernel is de ned as follows.</Paragraph> <Paragraph position="15"> where T 0 is the same as the original recursive function T de ned in (2), except that T is de ned on parse tree nodes, while T 0 is de ned on patterns of parse tree nodes. Cnt(x;y) counts the number of common elements in vector x and y. For example, Cnt((join;join;as); (join;join;in)) = 2, since 2 elements of the two vectors are the same.</Paragraph> <Paragraph position="16"> It can be shown that the lexicalized tree kernel counts the number of common sub-trees that meet the following constraints.</Paragraph> <Paragraph position="17"> None or one node in the sub-tree is lexicalized The lexicalized node is the root node or a child of the root, if applicable.</Paragraph> <Paragraph position="18"> Therefore our new tree kernel is more lexicalized. On the other hand, it immediately follows that the lexicalized tree kernel is well-de ned. It means that we can embed the lexicalized tree kernel into a high dimensional space. The proof is similar to the proof for the tree kernel in (Collins and Duffy, 2001).</Paragraph> <Paragraph position="19"> Another important advantage of the lexicalized tree kernel is that it is more compressible. It is noted in (Collins and Duffy, 2001) that training trees can be combined by sharing sub-trees to speed up the test. As far as the lexicalized tree kernel is concerned, the pattern trees are more compressible because there is no lexical item at the leaf nodes of pattern trees. Lexical information can be attached to the nodes of the result pattern forest. In our experiment, we select ve parses from each sentence in Collins' training data and represent these parses with shared structure. The number of the nodes in the pattern forest is only 1/7 of the total number of the nodes the selected parse trees.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Tree Kernel for Derivation Trees </SectionTitle> <Paragraph position="0"> In order to apply the (lexicalized) tree kernel to derivation trees, we need to make some modi cations to the original recursive de nition of the tree kernel.</Paragraph> <Paragraph position="1"> For derivation trees, the recursive function is triggered if the two root nodes have the same non-lexicalized elementary tree (sometimes called supertag). Note that these two nodes will have the same number of children which are initial trees (auxiliary trees are not counted). In comparison, the recursive function in (2), T(n1;n2) is computed if and only if n1 and n2 have the same bracketing tag and they have the same number of children.</Paragraph> <Paragraph position="2"> For each node, its children are attached with one of the two distinct operations, substitution or adjunction. For substituted children, the computation of the tree kernel is almost the same as that for CFG parse tree. However, there is a problem with the adjoined children. Let us rst have a look at a sentence in Penn Treebank.</Paragraph> <Paragraph position="3"> Example 2: COMMERCIAL PAPER placed directly by General Motors Acceptance Corp.: 8.55% 30 to 44 days; 8.25% 45 to 59 days; 8.45% 60 to 89 days; 8% 90 to 119 days; 7.90% 120 to 149 days; 7.80% 150 to 179 days; 7.55% 180 to 270 days.</Paragraph> <Paragraph position="4"> In this example, seven sub-trees of the same type are sister adjoined to the same place of an initial tree. So the number of common sub-trees increases dramatically if the tree kernel is applied on two similar parses of this sentence. Experimental evidence indicates that this is harmful to accuracy. Therefore, for derivation trees, we are only interested in sub-trees that contain at most 2 adjunction branches for each node. The number of constrained common sub-trees for the derivation tree kernel can be computed by the recursive function DT over derivation tree nodes n1;n2:</Paragraph> <Paragraph position="6"> where sub(nk) is the sub-tree of nk in which children adjoined to the root of nk are pruned. T&quot; is similar to the original recursive function T de ned in (2), but it is de ned on derivation tree nodes recursively. A1 and A2 are used to count the number of common sub-trees whose root nodes only contain one or two adjunction children respectively.</Paragraph> <Paragraph position="8"> where, a1i is the ith adjunct of n1, and a2j is the jth adjunct of n2. Similarly, we have:</Paragraph> <Paragraph position="10"> The tree kernel for derivation trees is a well-de ned kernel function because we can easily de ne an embedding space according to the de nition of the new tree kernel. By substituting DT for T 0 in (3), we obtain the lexicalized tree kernel for LTAG derivation trees (using LT in (1)).</Paragraph> </Section> </Section> class="xml-element"></Paper>