File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0401_metho.xml
Size: 15,512 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0401"> <Title>A model of syntactic disambiguation based on lexicalized grammars</Title> <Section position="3" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Probability Model based on Lexicalized </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Grammars </SectionTitle> <Paragraph position="0"> This section introduces our model of syntactic disambiguation, which is based on the decomposition of the parsing model into the syntax and semantics models. The concept behind it is that the plausibility of a parsing result is determined by i) the plausibility of syntax, and ii) selecting the most probable semantics from the structures allowed by the given syntax. This section formalizes the general form of statistical models for disambiguation of parsing including lexicalized parse trees, derivation trees, and dependency structures. Problems with the existing models are then discussed, and our model is introduced.</Paragraph> <Paragraph position="1"> Suppose that a set W of words and a set C of syntactic categories (e.g., nonterminal symbols of CFG, elementary trees of LTAG, feature structures of HPSG (Sag and Wasow, 1999)) are given. A lexicalized grammar is Lexicalized parse tree then defined as a tuple G = <L,R> , where L = {l = <w,c> |w [?]W,c [?]C}is a lexicon and R is a set of grammar rules. A parsing result of lexicalized grammars is defined as a labeled graph structure A = {a|a =</Paragraph> <Paragraph position="3"> example, the lexicalized parse tree in Figure 2 is represented in this form as in Figure 5, as well as the derivation tree and the dependency structure.</Paragraph> <Paragraph position="4"> Given the above definition, the existing models discussed in Section 2 yield a probability P(A|w) for given sentence w as in the following general form.</Paragraph> <Paragraph position="6"> In short, the probability of the complete structure is defined as the product of probabilities of lexical dependencies. For example, p(a|e) corresponds to the probability of branchings in LPCFG models, that of substitution/adjunction in derivation tree models, and that of primitive dependencies in dependency structure models.</Paragraph> <Paragraph position="7"> The models, however, have a crucial weakness with lexicalized grammar formalism; probability values are assigned to parsing results not allowed by the grammar, i.e., the model is no longer consistent. Hence, the disambiguation model of lexicalized grammars should not be decomposed into primitive lexical dependencies.</Paragraph> <Paragraph position="8"> A possible solution to this problem is to directly estimate p(A|w) by applying a maximum entropy model (Berger et al., 1996). However, such modeling will lead us to extensive tweaking of features that is theoretically unjustifiable, and will not contribute to the theoretical investigation of the relations of syntax and semantics.</Paragraph> <Paragraph position="9"> Since lexicalized grammars express all syntactic constraints by syntactic categories of words, we have assumed that we first determine which syntactic category c should be chosen, and then determine which argument relations are likely to appear under the constraints imposed by the syntactic categories. Formally, p(A|w)=p(c|w)p(A|c).</Paragraph> <Paragraph position="10"> The first probability in the above formula is the probability of syntactic categories, i.e., the probability of selecting a sequence of syntactic categories in a sentence.</Paragraph> <Paragraph position="11"> Since syntactic categories in lexicalized grammars determine the syntactic constraints of words, this expresses the syntactic preference of each word in a sentence. Note that our objective is not only to improve parsing accuracy but also to investigate the relation between syntax and semantics. We have not adopted the local contexts of words as in the supertaggers in LTAG (Joshi and Srinivas, 1994) because they partially include the semantic preferences of a sentence. The probability is purely unigram to select the probable syntactic category for each word. The probability is then given by the product of probabilities to select a syntactic category for each word from a set of candidate categories allowed by the lexicon.</Paragraph> <Paragraph position="13"> The second describes the probability of semantics, which expresses the semantic preferences of relating the words in a sentence. Note that the semantics probability is dependent on the syntactic categories determined by the syntax probability, because in lexicalized grammar formalism, a series of syntactic categories determines the possible structures of parsing results. Parsing results are obtained by solving the constraints given by the grammar.</Paragraph> <Paragraph position="14"> Hence, we cannot simply decompose semantics probability into the dependency probabilities of two words. We define semantics probability as a discriminative model that selects the most probable parsing result from a set of candidates given by parsing.</Paragraph> <Paragraph position="15"> Since semantics probability cannot be decomposed into independent sub-events, we applied a maximum entropy model, which allowed probabilistic modeling without the independence assumption. Using this model, we can assign consistent probabilities to parsing results with complex structures, such as ones represented with feature structures (Abney, 1997; Johnson et al., 1999). Given parsing result A, semantics probability is defined as follows: null where S(A) is a set of connected subgraphs of A, l(s) is a weight of subgraph s, and A(c) is a set of parsing results allowed by the sequence of syntactic categories c. Since we aim at separating syntactic and semantic preferences, feature functions for semantic probability distinguish only words, not syntactic categories. We should note that subgraphs should not be limited to an edge, i.e., the lexical dependency of two words. By taking more than one edge as a subgraph, we can represent the dependency of more than two words, although existing models do not adopt such dependencies. Various ambiguities should be resolved by considering the dependency of more than two words; e.g. PP-attachment ambiguity should be resolved by the dependency of three words.</Paragraph> <Paragraph position="16"> Consequently, the probability model takes the following form.</Paragraph> <Paragraph position="17"> However, this model has a crucial flaw: the maximum likelihood estimation of semantics probability is intractable. This is because the estimation requires Z c to be computed, which requires summation over A(c), exponentially many parsing results. To cope with this problem, we applied an efficient algorithm of maximum entropy estimation for feature forests (Miyao and Tsujii, 2002; Geman and Johnson, 2002). This enabled the tractable estimation of the above probability, when a set of candidates are represented in a feature forest of a tractable size.</Paragraph> <Paragraph position="18"> Here, we should mention that the disadvantages of the traditional models discussed in Section 2 have been completely solved by this model. It can be applied to any parsing results given by a lexicalized grammar, does not require the independence assumption, and is defined as a combination of syntax and semantics probabilities, where the semantics probability is a discriminative model that selects a parsing result from the set of candidates given by the syntax probability.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="3" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The model proposed in Section 3 is generally applicable to any lexicalized grammars, and this section reports the evaluation of our model with a wide-coverage LTAG grammar, which is automatically acquired from the Penn Treebank (Marcus et al., 1994) Sections 02-21. The grammar was acquired by an algorithm similar to (Xia, 1999), and consisted of 2,105 elementary trees, where 1,010 were initial trees and 1,095 were auxiliary ones.</Paragraph> <Paragraph position="1"> The coverage of the grammar against Section 22 (1,700 sentences) was 92.6% (1,575 sentences) in a weak sense (i.e., the grammar could output a structure consistent with the bracketing in the test corpus), and 68.0% (1,156 sentences) in a strong sense (i.e., the grammar could output exactly the correct derivation).</Paragraph> <Paragraph position="2"> Since the grammar acquisition algorithm could output derivation trees for the sentences in the training corpus (Section 02-21), we used them as a training set of the probability model. The model of syntax probability was estimated with syntactic categories appearing in the training set. For estimating the semantics probability, a parser produced all possible derivation trees for each sequence of syntactic categories (corresponding to each sentence) in the training set, and the obtained derivation trees, i.e., A(c), are passed to a maximum entropy estimator. By applying the grammar acquisition algorithm to Section 22, we obtained the derivation trees of the sentences in this section, and from this set we prepared a test set by eliminating non-sententials, long sentences (including more than 40 words), sentences not covered by the grammar, and sentences that caused time-outs in parsing. The resulting set consisted of 917 derivation trees.</Paragraph> <Paragraph position="3"> The following three disambiguation models were prepared using the training set.</Paragraph> <Paragraph position="4"> syntax Only composed of the syntax probability, i.e., p(c|w) traditional Similar to our model, but semantics probability p(A|c) was decomposed into the probabilities of the primitive dependencies of two words as in the traditional modeling, i.e., this model is an inconsistent probability model our model The model by maximum entropy estimation for feature forests The syntax probability was a unigram model, and contexts around the word such as previous words/categories were not used. Hence, it includes only syntactic preferences of words. The semantics parts of traditional and our model were maximum entropy models, where exactly the same set of features were used, i.e., the difference between the two models was only in an event representation: derivation trees were decomposed into primitive dependencies in traditional, while in our model they were represented by a feature forest without decomposition. Hence, we can evaluate the effects of applying maximum entropy estimation for feature forests by comparing our model with traditional. While our model allowed features to be incorporated that were not limited to the dependencies of two words (Section 3), the models used throughout the experiments only included features of the dependencies of two words. The semantics probabilities were developed with two sets of features includexact partial ing surface forms/POSs of words, the labels of dependencies (substitution/adjunction), and the distance between two words. The first feature set had 283,755 features and the other had 150,156 features excluding fine-grained features of the first set. There were 701,819 events for traditional, and 32,371 for our model. The difference in the number of events was caused by the difference in the units of events, i.e., an event corresponded to a dependency in traditional, while it corresponded to a sentence in our model.</Paragraph> <Paragraph position="5"> The parameters of the models were estimated by the limited-memory BFGS algorithm (Nocedal, 1980) with a Gaussian distribution as the prior probability distribution for smoothing (Chen and Rosenfeld, 1999) implemented in a maximum entropy estimator for feature forests (Miyao, 2002). The estimation for traditional was converged in 67 iterations in 127 seconds, and our model in 29 iterations in 111 seconds on a Pentium III 1.26-GHz CPU with 4 GB of memory. These results reveal that the estimation with our model is comparatively efficient with traditional. The parsing algorithm was CKY-style parsing with beam thresholding, which was similar to ones used in (Collins, 1996; Clark et al., 2002). Although we needed to compute normalizing factor Z c to obtain probability values, we used unnormalized products as the preference score for beam thresholding, following (Clark et al., 2002). We did not use any preprocessing such as supertagging (Joshi and Srinivas, 1994) and the parser searched for the most plausible derivation tree from the derivation forest in terms of the probability given by the combination of syntax and semantics probabilities.</Paragraph> <Paragraph position="6"> Tables 1 and 2 list the accuracy of dependencies, i.e., edges in derivation trees, for each model with two sets of features for the semantics model . Since in derivation trees each word in a sentence depends on one and only one word (see Figure 3), the accuracy is the number of Since the features of the syntax part were not changed, the results for syntax are exactly the same.</Paragraph> <Paragraph position="7"> correct edges divided by the number of all edges in the tree. The exact column indicates the ratio of dependencies where the syntactic category, the argument position, and the dependee head word of the argument word are correctly output. The partial column shows the ratio of dependencies where the words are related regardless of the label. We should note that the exact measure is a very stringent because the model must select the correct syntactic category from 2,105 categories.</Paragraph> <Paragraph position="8"> First, we can see that syntax achieved a high level of accuracy although it was not quite sufficient yet. We think this was because the grammar could adequately restrict the possible structure of parsing results, and the disambiguation model tried to search for the most probable structure from the candidates allowed by the grammar.</Paragraph> <Paragraph position="9"> Second, traditional and our model recorded significantly higher accuracy than syntax. The accuracy of our model was almost matched traditional, which proved the validity of probabilistic modeling with maximum entropy estimation for feature forests. The differences between traditional and our model were insignificant and the results proved that a consistent probability model of parsing can be built without the independence assumption, and attains performance that rivals the traditional models in terms of parsing accuracy.</Paragraph> <Paragraph position="10"> We should note that accuracy can further be improved with our model because it allows other features to be incorporated that were not used in these experiments because the model is not rely on the decomposition into the dependencies of two words. Another possibility to increase the accuracy is to refine the LTAG grammar. Although we assumed that all syntactic constraints were expressed with syntactic categories (Section 3), i.e., elementary trees, the grammar used in the experiments were not augmented with feature structures and not sufficiently restrictive to eliminate syntactically invalid structures. Since our model did not include the preferences of syntactic relations of words, we expect the refinement of the grammar will greatly improve the accuracy.</Paragraph> </Section> class="xml-element"></Paper>