File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/p05-1010_concl.xml
Size: 8,710 bytes
Last Modified: 2025-10-06 13:54:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1010"> <Title>Probabilistic CFG with latent annotations</Title> <Section position="5" start_page="78" end_page="84" type="concl"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We conducted four sets of experiments. In the first set of experiments, the degree of dependency of trained models on initialization was examined because EM-style algorithms yield different results with different initial values of parameters. In the second set of experiments, we examined the relationship between model types and their parsing performances. In the third set of experiments, we compared the three parsing methods described in the previous section. Finally, we show the result of a parsing experiment using the standard test set.</Paragraph> <Paragraph position="1"> We used sections 2 through 20 of the Penn WSJ corpus as training data and section 21 as heldout data. The heldout data was used for early stopping; i.e., the estimation was stopped when the rate of increase in the likelihood of the heldout data became lower than a certain threshold. Section 22 was used as test data in all parsing experiments except in the final one, in which section 23 was used. We stripped off all function tags and eliminated empty nodes in the training and heldout data, but any other pre-processing, such as comma raising or base-NP marking (Collins, 1999), was not done except for binarizations.</Paragraph> <Section position="1" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 4.1 Dependency on initial values </SectionTitle> <Paragraph position="0"> To see the degree of dependency of trained models on initializations, four instances of the same model were trained with different initial values of parameters.3 The model used in this experiment was created by CENTER-PARENT binarization and a118a48a119a118 was set to 16. Table 1 lists training/heldout data log-likelihood per sentence (LL) for the four instances and their parsing performances on the test set (section 22). The parsing performances were obtained using the approximate distribution method in Section 3.2. Different initial values were shown to affect the results of training to some extent (Table 1).</Paragraph> </Section> <Section position="2" start_page="79" end_page="84" type="sub_section"> <SectionTitle> 4.2 Model types and parsing performance </SectionTitle> <Paragraph position="0"> We compared four types of binarization. The original form is depicted in Figure 5 and the results are shown in Figure 6. In the first two methods, called CENTER-PARENT and CENTER-HEAD, the head-finding rules of Collins (1999) were used. We obtained an observable grammar a129 for each model by reading off grammar rules from the binarized training trees. For each binarization method, PCFG-LA models with different numbers of latent annotation symbols, a118a48a119a118a29a49a77a156 a53a36a115a16a53a117a116a148a53a36a118 , and a156a51a119 , were trained. The relationships between the number of parameters in the models and their parsing performances are shown in Figure 7. Note that models created using different binarization methods have different numbers of parameters for the same a118a48a119a118. The parsing performances were measured using Fa5 scores of the parse trees that were obtained by re-ranking of 1000-best parses by a PCFG.</Paragraph> <Paragraph position="1"> We can see that the parsing performance gets better as the model size increases. We can also see that models of roughly the same size yield similar performances regardless of the binarization scheme used for them, except the models created using LEFT binarization with small numbers of parameters (a118a48a87a118a99a49 a156 anda115 ). Taking into account the dependency on initial values at the level shown in the previous experiment, we cannot say that any single model is superior to the other models when the sizes of the models are large enough.</Paragraph> <Paragraph position="2"> The results shown in Figure 7 suggest that we could further improve parsing performance by increasing the model size. However, both the memory size and the training time are more than linear in a118a48a119a118, and the training time for the largest (a118a48a119a118a23a49a91a156a51a119 ) models was about 15 hours for the models created usingCENTER-PARENT,CENTER-HEAD, andLEFT and about 20 hours for the model created using RIGHT. To deal with larger (e.g., a118a48a119a118 = 32 or 64) models, we therefore need to use a model search that reduces the number of parameters while maintaining the model's performance, and an approximation during training to reduce the training time.</Paragraph> </Section> <Section position="3" start_page="84" end_page="84" type="sub_section"> <SectionTitle> 4.3 Comparison of parsing methods </SectionTitle> <Paragraph position="0"> The relationships between the average parse time and parsing performance using the three parsing methods described in Section 3 are shown in Figure 8. A model created using CENTER-PARENT with a118a48a87a118a99a49a91a156a51a119 was used throughout this experiment. The data points were made by varying configurable parameters of each method, which control the number of candidate parses. To create the candidate parses, we first parsed input sentences using a PCFG4, using beam thresholding with beam width a120 . The data points on a line in the figure were created by varyinga120 with other parameters fixed. The first method re-ranked the a76 -best parses enumerated from the chart after the PCFG parsing. The two lines for the first method in the figure correspond to a76 = 100 and a76 = 300. In the second and the third methods, we removed all the dominance relations among chart items that did not contribute to any parses whose PCFG-scores were higher thana121 a61 max, where a61 max is the PCFG-score of the best parse in the chart. The parses remaining in the chart were the candidate parses for the second and the third methods. The different lines for the second and the third methods correspond to different values ofa121 .</Paragraph> <Paragraph position="1"> The third method outperforms the other two methods unless the parse time is very limited (i.e., a122 1 the same as the one that Klein and Manning (2003) call a 'markovised PCFG with vertical order = 2 and horizontal order = 1' and was extracted from Section 02-20. The PCFG itself gave a performance of 79.6/78.5 LP/LR on the development set. This PCFG was also used in the experiment in Section 4.4. sec is required), as shown in the figure. The superiority of the third method over the first method seems to stem from the difference in the number of candidate parses from which the outputs are selected.5 The superiority of the third method over the second method is a natural consequence of the consistent use of a61 a51a62a38 a59 both in the estimation (as the objective function) and in the parsing (as the score of a parse).</Paragraph> </Section> <Section position="4" start_page="84" end_page="84" type="sub_section"> <SectionTitle> 4.4 Comparison with related work </SectionTitle> <Paragraph position="0"> Parsing performance on section 23 of the WSJ corpus using a PCFG-LA model is shown in Table 2.</Paragraph> <Paragraph position="1"> We used the instance of the four compared in the second experiment that gave the best results on the development set. Several previously reported results on the same test set are also listed in Table 2.</Paragraph> <Paragraph position="2"> Our result is lower than the state-of-the-art lexicalized PCFG parsers (Collins, 1999; Charniak, 1999), but comparable to the unlexicalized PCFG parser of Klein and Manning (2003). Klein and Manning's PCFG is annotated by many linguistically motivated features that they found using extensive manual feature selection. In contrast, our method induces all parameters automatically, except that manually written head-rules are used in binarization. Thus, our method can extract a considerable amount of hidden regularity from parsed corpora. However, our result is worse than the lexicalized parsers despite the fact that our model has access to words in the sentences. It suggests that certain types of information used in those lexicalized 5Actually, the number of parses contained in the packed forest is more than 1 million for over half of the test sentences</Paragraph> <Paragraph position="4"> , while the number of parses for which the first method can compute the exact probability in a comparable time (around 4 sec) is only about 300.</Paragraph> <Paragraph position="5"> parsers are hard to be learned by our approach.</Paragraph> </Section> </Section> class="xml-element"></Paper>