File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/00/w00-0734_ackno.xml
Size: 4,756 bytes
Last Modified: 2025-10-06 13:50:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0734"> <Title>Chunking with WPDV Models</Title> <Section position="4" start_page="154" end_page="155" type="ackno"> <SectionTitle> 3 Results </SectionTitle> <Paragraph position="0"> The Ff~=l scores for all systems are listed in Table 1. They vary greatly per phrase type, partly because of the relative difficulty of the tasks but also because of the variation in the number of relevant training and test cases: the most frequent phrase types (NP, PP and VP) also show the best results. Note that three of the phrase types (CONJP, INTJ and LST) are too infrequent to yield statistically sensible information. The TiMBL results are worse than the ones reported by Buchholz et al. (1999), 5 but the latter were based on training on WSJ sections 0019 and testing on 20-24. When comparing with the NP scores of Daelemans et al. (1999), we see a comparable accuracy (actually slightly higher because of the second level classification).</Paragraph> <Paragraph position="1"> The WPDV accuracies are almost all much higher. For NP, the basic and reverse model produce accuracies which can compete with the highest published non-combination accuracies so far. Interestingly, the reverse model yields the best overall score. This can be explained by the observation that many choices, e.g. PP/PRT and especially ADJP/part of NP, are based mostly on the right context, about which more information becomes available when the text is handled from right to left. The R&M-type IOB-tags are generally less useful than the standard ones, but still show exceptional quality for some phrase types, e.g. PRT. The results for the LOB model are disappointing, given the overall quality of the tagger used</Paragraph> <Paragraph position="3"> test data precision (97.82% on the held-out 10% of LOB). I hypothesize this to be due to: a) differences in text type between LOB and WSJ, b) partial incompatibility between the LOB tags and the WSJ chunks and c) insufficiency of chunker training set size for the more varied LOB tags.</Paragraph> <Paragraph position="4"> Combination, as in other tasks (e.g. van Halteren et al. (To appear)), leads to an impressive accuracy increase, especially for the three most frequent phrase types, where there is a sufficient number of cases to train the combination model on. There are only two phrase types, ADVP and SBAR, where a base chunker (reverse WPDV) manages to outperform the combination. In both cases the four normal direction base chunkers outvote the better-informed reverse chunker, probably because the combination system has insufficient training material to recognize the higher information value of the reverse model (for these two phrase types). Even though the results are already quite good, I expect that even more effective combination is possible, with an increase in training set size and the inclusion of more base chunkers, especially ones which differ substantially from the current, still rather homogeneous, set.</Paragraph> <Paragraph position="5"> The corrective measures yield further improvement, although less impressive. Unsurprisingly, the increase is found mostly for the NP. The next most affected phrase type is the ADJP, which can often be joined with or removed from the NP. There is an increase in recall for ADJP (71.23% to 71.46%), but a decrease in precision (78.20% to 77.86%), leaving the FZ=I value practically unchanged. For ADVP, there is a loss of accuracy, most likely caused by the one-shot correction procedure.</Paragraph> <Paragraph position="6"> This loss will probably disappear when a procedure is used which is iterative and also targets other phrase types than the NP. For VP, on the other hand, there is an accuracy increase, probably due to a corrected inclusion/exclusion of participles into/from NPs. The overall scores show an increase, especially due to the per-type increases for the very frequent NP and VP.</Paragraph> <Paragraph position="7"> All scores for the chunking system as a whole, including precision and recall percentages, are listed in Table 2. For all phrase types, the system yields substantially better results than any previously published. I attribute the improvements primarily to the combination archi- null ter applying corrective measures to base chunker combination.</Paragraph> <Paragraph position="8"> tecture, with a smaller but yet valuable contribution by the corrective measures. The choice for WPDV proves a good one, as the WPDV algorithm is able to cope well with all the modeling tasks in the system. Whether it is the best choice can only be determined by future experiments, using other machine learning techniques in the same architecture.</Paragraph> </Section> class="xml-element"></Paper>