File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1034_evalu.xml
Size: 4,827 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1034"> <Title>XML-Based Data Preparation for Robust Deep Parsing</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation and Future Research </SectionTitle> <Paragraph position="0"> With a corpus such as OHSUMED where there is no gold-standard tagged or hand-parsed subpart, it is hard to reliably evaluate our system. However, we did an experiment on 200 sentences taken at random from the corpus (average sentence length: 21 words). We ran three versions of our pre-processor over the 200 sentences to produce three different input files for the parser and for each input we counted the sentences which were assigned at least one parse. All three versions started from the same basic XML annotated data, where words were tagged by both taggers and parenthesised material was removed. Version 1 converted from this format to ANLT input simply by discarding the mark-up and separating off punctuation. Version 2 was the same except that content word POS tags were retained. Version 3 was put through our full pipeline which recognises formulae, numbers etc. and which corrects some tagging errors. The following table shows numbers of sentences successfully parsed with each of the three different inputs: Version 1 Version 2 Version 3 Parses 4 (2%) 32 (16%) 79 (39.5%) The extremely low success rate of Version 1 is a reflection of the fact that the ANLT lexicon does not contain any specialist lexical items. In fact, of the 200 sentences, 188 contained words that were not in the lexicon, and of the 12 that remained, 4 were successfully parsed. The figure for Version 2 gives a crude measure of the contribution of our use of tags in lexical look-up and the figure for Version 3 shows further gains when further pre-processing techniques are used.</Paragraph> <Paragraph position="1"> Although we have achieved an encouraging overall improvement in performance, the total of 39.5% for Version 3 is not a precise reflection of accuracy of the parser. In order to determine accuracy, we hand-examined the parser output for the 79 sentences that were parsed and recorded whether or not the correct parse was among the parses found. Of these 79 sentences, 61 (77.2%) were parsed correctly while 18 (22.8%) were not, giving a total accuracy measure of 30.5% for Version 3. While this figure is rather low for a practical application, it is worth reiterating that this still means that nearly one in three sentences are not only correctly parsed but they are also assigned a logical form. We are confident that the further work outlined below will achieve an improvement in performance which will lead to a useful semantic analysis of a significant proportion of the corpus. Furthermore, in the case of the 18 sentences which were parsed incorrectly, it is important to note that the 'wrong' parses may sometimes be capable of yielding useful semantic information.</Paragraph> <Paragraph position="2"> For example, the grammar's compounding rules do not yet include the possibility of coordinations within compounds so that the NP the MS and direct blood pressure methods can only be wrongly parsed as a coordination of two NPs. However, the rest of the sentence in which the NP occurs is correctly parsed.</Paragraph> <Paragraph position="3"> An analysis of the 18 sentences which were parsed incorrectly reveals that the reasons for failure are distributed evenly across three causes: a word was mistagged and not corrected during pre-processing (6); the segmentation into tokens was inadequate (5); and the grammar lacked coverage (7). A casual inspection of a random sample of 10 of the sentences which failed to parse at all reveals a similar pattern although for several there were multiple reasons for failure. Lack of grammatical coverage was more in evidence, perhaps not surprisingly since work on tuning the grammar to the domain has not yet been done.</Paragraph> <Paragraph position="4"> Although we are only able to parse between 30 and 40 percent of the corpus, we will be able to improve on that figure quite considerably in the future through continued development of the pre-processing component. Moreover, we have not yet incorporated any domain specific lexical knowledge from, e.g., UMLS but we would expect this to contribute to improved performance. Furthermore, our current level of success has been achieved without significant changes to the original grammar and, once we start to tailor the grammar to the domain, we will gain further significant increases in performance. As a final stage, we may find it useful to follow Kasper et al. (1999) and have a 'fallback' strategy for failed parses where the best partial analyses are assembled in a robust processing phase.</Paragraph> </Section> class="xml-element"></Paper>