File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1038_metho.xml
Size: 22,888 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1038"> <Title>Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French</Title> <Section position="4" start_page="307" end_page="307" type="metho"> <SectionTitle> 3 Tree Transformations </SectionTitle> <Paragraph position="0"> We created a number of different datasets from the FTB, applying various tree transformation to deal with the peculiarities of the FTB annotation scheme.</Paragraph> <Paragraph position="1"> As a first step, the XML formatted FTB data was converted to PTB-style bracketed expressions. Only the POS tag was kept and the rest of the morphological information for each terminal was discarded. For example, the NP in Figure 1 was transformed to:</Paragraph> </Section> <Section position="5" start_page="307" end_page="307" type="metho"> <SectionTitle> (3) (NP (PRO eux)) </SectionTitle> <Paragraph position="0"> In order to make our results comparable to results from the literature, we also transformed the annotation of punctuation. In the FTB, all punctuations is tagged uniformly as PONCT. We re-assigned the POS for punctuation using the PTB tagset, which differentiates between commas, periods, brackets, etc.</Paragraph> <Paragraph position="1"> Compounds have internal structure in the FTB (see Section 2.1). We created two separate data sets by applying two alternative tree transformation to make FTB compounds more similar to compounds in other annotation schemes. The first was collapsing the compound by concatenating the compound parts using an underscore and picking up the cat information supplied at the compound level. For example, the compound in Figure 2 results in: (4) (P d' entre) This approach is similar to the treatment of compounds in the German Negra treebank (used by Dubey and Keller 2003), where compounds are not given any internal structure (compounds are mostly spelled without spaces or apostrophes in German).</Paragraph> <Paragraph position="2"> The second approach is expanding the compound.</Paragraph> <Paragraph position="3"> Here, the compound parts are treated as individual words with their own POS (from the catint tag), and the suffix Cmp is appended the POS of the compound, effectively expanding the tagset.2 Now Figure 2 yields: (5) (PCmp (P d') (P entre)).</Paragraph> <Paragraph position="4"> This approach is similar to the treatment of compounds in the PTB (except hat the PTB does not use a separate tag for the mother category). We found that in the FTB the POS tag of the compound part is sometimes missing (i.e., the value of catint is blank). In cases like this, the missing catint was substituted with the cat tag of the compound. This heuristic produces the correct POS for the subparts of the compound most of the time.</Paragraph> <Paragraph position="5"> 2An alternative would be to retain the cat tag of the compound. The effect of this decision needs to be investigated in future work.</Paragraph> <Paragraph position="7"> after transformation (middle); coordination in the PTB (right) As mentioned previously, coordinate structures have their own constituent label COORD in the FTB annotation. Existing parsing models (e.g., the Collins models) have coordination-specific rules, presupposing that coordination is marked up in PTB format. We therefore created additional datasets where a transformation is applied that raises coordination. This is illustrated in Figure 3. Note that in the FTB annotation scheme, a coordinating conjunction is always followed by a syntactic category. Hence the resulting tree, though flatter, is still not fully compatible with the PTB treatment of coordination. null</Paragraph> </Section> <Section position="6" start_page="307" end_page="308" type="metho"> <SectionTitle> 4 Probabilistic Parsing Models </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="307" end_page="307" type="sub_section"> <SectionTitle> 4.1 Probabilistic Context-Free Grammars </SectionTitle> <Paragraph position="0"> The aim of this paper is to further explore the crosslinguistic role of lexicalization by applying lexicalized parsing models to the French Treebank parsing accuracy. Following Dubey and Keller (2003), we use a standard unlexicalized PCFG as our baseline. In such a model, each context-free rule RHS LHS is annotated with an expansion probability P(RHS|LHS). The probabilities for all the rules with the same left-hand side have to sum up to one and the probability of a parse tree T is defined as the product of the probabilities of each rule applied in the generation of T .</Paragraph> </Section> <Section position="2" start_page="307" end_page="308" type="sub_section"> <SectionTitle> 4.2 Collins' Head-Lexicalized Models </SectionTitle> <Paragraph position="0"> A number of lexicalized models can then be applied to the FTB, comparing their performance to the unlexicalized baseline. We start with Collins' Model 1, which lexicalizes a PCFG by associating a word w and a POS tag t with each non-terminal X in the tree. Thus, a non-terminal is written as X(x) where x = <w,t> and X is constituent label. Each rule now has the form:</Paragraph> <Paragraph position="2"> Here, H is the head-daughter of the phrase, which inherits the head-word h from its parent P. L1 . . .Ln and R1 . . .Rn are the left and right sisters of H. Either n or m may be zero, and n = m for unary rules.</Paragraph> <Paragraph position="3"> The addition of lexical heads leads to an enormous number of potential rules, making direct estimation of P(RHS|LHS) infeasible because of sparse data. Therefore, the generation of the RHS of a rule given the LHS is decomposed into three steps: first the head is generated, then the left and right sisters are generated by independent 0th-order Markov processes. The probability of a rule is thus defined as:</Paragraph> <Paragraph position="5"> (2) Here, Ph is the probability of generating the head, Pl and Pr are the probabilities of generating the left and right sister respectively. Lm+1(lm+1) and Rm+1(rm+1) are defined as stop categories which indicate when to stop generating sisters. d(i) is a distance measure, a function of the length of the surface string between the head and the previously generated sister.</Paragraph> <Paragraph position="6"> Collins' Model 2 further refines the initial model by incorporating the complement/adjunct distinction and subcategorization frames. The generative process is enhanced to include a probabilistic choice of left and right subcategorization frames. The probability of a rule is now:</Paragraph> <Paragraph position="8"> Here, LC and RC are left and right subcat frames, multisets specifying the complements that the head requires in its left or right sister. The subcat requirements are added to the conditioning context. As complements are generated, they are removed from the appropriate subcat multiset.</Paragraph> </Section> </Section> <Section position="7" start_page="308" end_page="309" type="metho"> <SectionTitle> 5 Experiment 1: Unlexicalized Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="308" end_page="308" type="sub_section"> <SectionTitle> 5.1 Method </SectionTitle> <Paragraph position="0"> This experiment was designed to compare the performance of the unlexicalized baseline model on four different datasets, created by the tree transformations described in Section 3: compounds expanded (Exp), compounds contracted (Cont), compounds expanded with coordination raised (Exp+CR), and compounds contracted with coordination raised (Cont+CR).</Paragraph> <Paragraph position="1"> We used BitPar (Schmid, 2004) for our unlexicalized experiments. BitPar is a parser based on a bit-vector implementation of the CKY algorithm. A grammar and lexicon were read off our training set, along with rule frequencies and frequencies for lexical items, based on which BitPar computes the rule [?]40 words); each model performed its own POS tagging.</Paragraph> <Paragraph position="2"> probabilities using maximum likelihood estimation. A frequency distribution for POS tags was also read off the training set; this distribution is used by BitPar to tag unknown words in the test data.</Paragraph> <Paragraph position="3"> All models were evaluated using standard Parseval measures of labeled recall (LR), labeled precision (LP), average crossing brackets (CBs), zero crossling brackets (0CB), and two or less crossing brackets ([?]2CB). We also report tagging accuracy (Tag), and coverage (Cov).</Paragraph> </Section> <Section position="2" start_page="308" end_page="308" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> The results for the unlexicalized model are shown in Table 1 for sentences of length [?]40 words. We find that contracting compounds increases parsing performance substantially compared to expanding compounds, raising labeled recall from around 60% to around 64% and labeled precision from around 59% to around 65%. The results show that raising co-ordination is also beneficial; it increases precision and recall by 1-2%, both for expanded and for nonexpanded compounds.</Paragraph> <Paragraph position="1"> Note that these results were obtained by uniformly applying coordination raising during evaluation, so as to make all models comparable. For the Exp and Cont models, the parsed output and the gold standard files were first converted by raising coordination and then the evaluation was performed.</Paragraph> </Section> <Section position="3" start_page="308" end_page="309" type="sub_section"> <SectionTitle> 5.3 Discussion </SectionTitle> <Paragraph position="0"> The disappointing performance obtained for the expanded compound models can be partly attributed to the increase in the number of grammar rules (11,704 expanded vs. 10,299 contracted) and POS tags (24 expanded vs. 11 contracted) associated with that transformation.</Paragraph> <Paragraph position="1"> However, a more important point observation is that the two compound models do not yield comparable results, since an expanded compound has more brackets than a contracted one. We attempted to address this problem by collapsing the compounds for evaluation purposes (as described in Section 3). For example, (5) would be contracted to (4). However, this approach only works if we are certain that the model is tagging the right words as compounds. Un- null fortunately, this is rarely the case. For example, the model outputs: (6) (NCmp (N jours) (N commerc,ants)) But in the gold standard file, jours and commerc,ants are two distinct NPs. Collapsing the compounds therefore leads to length mismatches in the test data. This problem occurs frequently in the test set, so that such an evaluation becomes pointless.</Paragraph> </Section> </Section> <Section position="8" start_page="309" end_page="311" type="metho"> <SectionTitle> 6 Experiment 2: Lexicalized Models </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="309" end_page="310" type="sub_section"> <SectionTitle> 6.1 Method </SectionTitle> <Paragraph position="0"> Parsing We now compare a series of lexicalized parsing models against the unlexicalized baseline established in the previous experiment. Our is was to test if French behaves like English in that lexicalization improves parsing performance, or like German, in that lexicalization has only a small effect on parsing performance.</Paragraph> <Paragraph position="1"> The lexicalized parsing experiments were run using Dan Bikel's probabilistic parsing engine (Bikel, 2002) which in addition to replicating the models described by Collins (1997) also provides a convenient interface to develop corresponding parsing models for other languages.</Paragraph> <Paragraph position="2"> Lexicalization requires that each rule in a grammar has one of the categories on its right hand side annotated as the head. These head rules were constructed based on the FTB annotation guidelines (provided along with the dataset), as well as by using heuristics, and were optimized on the development set. Collins' Model 2 incorporates a complement/adjunct distinction and probabilities over sub-categorization frames. Complements were marked in the training phase based on argument identification rules, tuned on the development set.</Paragraph> <Paragraph position="3"> Part of speech tags are generated along with the words in the models; parsing and tagging are fully integrated. To achieve this, Bikel's parser requires a mapping of lexical items to orthographic/morphological word feature vectors. The features implemented (capitalization, hyphenation, inflection, derivation, and compound) were again optimized on the development set.</Paragraph> <Paragraph position="4"> Like BitPar, Bikel's parser implements a probabilistic version of the CKY algorithm. As with normal CKY, even though the model is defined in a top-down, generative manner, decoding proceeds bottom-up. To speed up decoding, the algorithm implements beam search. Collins uses a beam width of 104, while we found that a width of 105 gave us the best coverage vs. parsing speed trade-off.</Paragraph> <Paragraph position="5"> stituents in three treebanks Flatness As already pointed out in Section 2.1, the FTB uses a flat annotation scheme. This can be quantified by computing the average number of daughters for each syntactic category in the FTB, and comparing them with the figures available for PTB and Negra (Dubey and Keller, 2003). This is done in Table 2. The absence of sentence-internal VPs explains the very high level of flatness for the sentential category SENT (5.84 daughters), compared to the PTB (2.44), and even to Negra, which is also very flat (4.55 daughters). The other sentential categories Ssub (subordinate clauses), Srel (relative clause), and Sint (interrogative clause) are also very flat. Note that the FTB uses VP nodes only for non-finite subordinate clauses: VPinf (infinitival clause) and VPpart (participle clause); these categories are roughly comparable in flatness to the VP category in the PTB and Negra. For NP, PPs, APs, and AdvPs the FTB is roughly as flat as the PTB, and somewhat less flat than Negra.</Paragraph> <Paragraph position="6"> Sister-Head Model To cope with the flatness of the FTB, we implemented three additional parsing models. First, we implemented Dubey and Keller's (2003) sister-head model, which extends Collins' base NP model to all syntactic categories. This means that the probability function Pr in equation (2) is no longer conditioned on the head but instead on its previous sister, yielding the following definition for Pr (and by analogy Pl):</Paragraph> <Paragraph position="8"> Dubey and Keller (2003) argue that this implicitly adds binary branching to the grammar, and therefore provides a way of dealing with flat annotation (in Negra and in the FTB, see Table 2).</Paragraph> <Paragraph position="9"> Bigram Model This model, inspired by the approach of Collins et al. (1999) for parsing the Prague Dependency Treebank, builds on Collins' Model 2 by implementing a 1st order Markov assumption for the generation of sister non-terminals. The latter are now conditioned, not only on their head, but also on the previous sister. The probability function for Pr</Paragraph> <Paragraph position="11"> [?]40 words); each model performed its own POS tagging; all lexicalized models used the Cont+CR data set The intuition behind this approach is that the model will learn that the stop symbol is more likely to follow phrases with many sisters. Finally, we also experimented with a third model (BigramFlat) that applies the bigram model only for categories with high degrees of flatness (SENT, Srel, Ssub, Sint, VPinf, and VPpart).</Paragraph> </Section> <Section position="2" start_page="310" end_page="311" type="sub_section"> <SectionTitle> 6.2 Results </SectionTitle> <Paragraph position="0"> Constituency Evaluation The lexicalized models were tested on the Cont+CR data set, i.e., compounds were contracted and coordination was raised (this is the configuration that gave the best performance in Experiment 1).</Paragraph> <Paragraph position="1"> Table 3 shows that all lexicalized models achieve a performance of around 80% recall and precision, i.e., they outperform the best unlexicalized model by at least 14% (see Table 1). This is consistent with what has been reported for English on the PTB.</Paragraph> <Paragraph position="2"> Collins' Model 2, which adds the complement/adjunct distinction and subcategorization frames achieved only a very small improvement over Collins' Model 1, which was not statistically significant using a kh2 test. It might well be that the annotation scheme of the FTB does not lend itself particularly well to the demands of Model 2.</Paragraph> <Paragraph position="3"> Moreover, as Collins (1997) mentions, some of the benefits of Model 2 are already captured by inclusion of the distance measure.</Paragraph> <Paragraph position="4"> A further small improvement was achieved using Dubey and Keller's (2003) sister-head model; however, again the difference did not reach statistical significance. The bigram model, however, yielded a statistically significant improvement over Collins' Model 1 (recall kh2 = 3.91, df = 1, p[?].048; precision kh2 = 3.97, df = 1, p [?] .046). This is consistent with the findings of Collins et al. (1999) for Czech, where the bigram model upped dependency accuracy by about 0.9%, as well as for English where Charniak (2000) reports an increase in F-score of approximately 0.3%. The BigramFlat model, which applies the bigram model to only those labels which have a high degree of flatness, performs ized models (sentences [?]40 words) with correct POS tags supplied; all lexicalized models used the Cont+CR data set at roughly the same level as Model 1.</Paragraph> <Paragraph position="5"> The models in Tables 1 and 3 implemented their own POS tagging. Tagging accuracy was 91-93% for BitPar (unlexicalized models) and around 96% for the word-feature enhanced tagging model of the Bikel parser (lexicalized models). POS tags are an important cue for parsing. To gain an upper bound on the performance of the parsing models, we reran the experiments by providing the correct POS tag for the words in the test set. While BitPar always uses the tags provided, the Bikel parser only uses them for words whose frequency is less than the unknown word threshold. As Table 4 shows, perfect tagging increased parsing performance in the lexicalized models by around 3%. This shows that the poor POS tagging performed by BitPar is one of the reasons of the poor performance of the lexicalized models. The impact of perfect tagging is less drastic on the lexicalized models (around 1% increase). However, our main finding, viz., that lexicalized models outperform unlexicalized models considerable on the FTB, remains valid, even with perfect tagging.3 Dependency Evaluation We also evaluated our models using dependency measures, which have been argued to be more annotation-neutral than Parseval. Lin (1995) notes that labeled bracketing scores are more susceptible to cascading errors, where one incorrect attachment decision causes the scoring algorithm to count more than one error.</Paragraph> <Paragraph position="6"> The gold standard and parsed trees were converted into dependency trees using the algorithm described by Lin (1995). Dependency accuracy is defined as the ratio of correct dependencies over the total number of dependencies in a sentence. (Note that this is an unlabeled dependency measure.) Dependency accuracy and constituency F-score are shown 3It is important to note that the Collins model has a range of other features that set it apart from a standard unlexicalized PCFG (notably Markovization), as discussed in Section 4.2. It is therefore likely that the gain in performance is not attributable to lexicalization alone.</Paragraph> <Paragraph position="7"> in Table 5 for the most relevant FTB models. (Fscore is computed as the geometric mean of labeled recall and precision.) Numerically, dependency accuracies are higher than constituency F-scores across the board. However, the effect of lexicalization is the same on both measures: for the FTB, a gain of 11% in dependency accuracy is observed for the lexicalized model.</Paragraph> </Section> </Section> <Section position="9" start_page="311" end_page="311" type="metho"> <SectionTitle> 7 Experiment 3: Crosslinguistic </SectionTitle> <Paragraph position="0"> Comparison The results reported in Experiments 1 and 2 shed some light on the role of lexicalization for parsing French, but they are not strictly comparable to the results that have been reported for other languages. This is because the treebanks available for different languages typically vary considerably in size: our FTB training set was about 8,500 sentences large, while the standard training set for the PTB is about 40,000 sentences in size, and the Negra training set used by Dubey and Keller (2003) comprises about 18,600 sentences. This means that the differences in the effect of lexicalization that we observe could be simply due to the size of the training set: lexicalized models are more susceptible to data sparseness than unlexicalized ones.</Paragraph> <Paragraph position="1"> We therefore conducted another experiment in which we applied Collins' Model 2 to subsets of the PTB that were comparable in size to our FTB data sets. We combined sections 02-05 and 08 of the PTB (8,345 sentences in total) to form the training set, and the first 1,000 sentences of section 23 to form our test set. As a baseline model, we also run an unlexicalized PCFG on the same data sets.</Paragraph> <Paragraph position="2"> For comparison with Negra, we also include the results of Dubey and Keller (2003): they report the performance of Collins' Model 1 on a data set of 9,301 sentences and a test set of 1,000 sentences, which are comparable in size to our FTB data sets.4 The results of the crosslinguistic comparison are shown in Table 6.5 We conclude that the effect of</Paragraph> </Section> <Section position="10" start_page="311" end_page="311" type="metho"> <SectionTitle> [?]40 words) </SectionTitle> <Paragraph position="0"> lexicalization is stable even if the size of the training set is held constant across languages: For the FTB we find that lexicalization increases F-score by around 13%. Also for the PTB, we find an effect of lexicalization of about 14%. For the German Negra treebank, however, the performance of the lexicalized and the unlexicalized model are almost indistinguishable. (This is true for Collins' Model 1; note that Dubey and Keller (2003) do report a small improvement for the lexicalized sister-head model.)</Paragraph> </Section> class="xml-element"></Paper>