File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/i05-6006_concl.xml
Size: 3,037 bytes
Last Modified: 2025-10-06 13:54:37
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-6006"> <Title>The Syntactically Annotated ICE Corpus and the Automatic Induction of a Formal Grammar</Title> <Section position="4" start_page="99" end_page="99" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> This article described a corpus of contemporary English that is linguistically annotated at both grammatical and syntactic levels. It then described a formal grammar that is automatically generated from the corpus and presented statistics outlining the learning curve of the grammar as a function of training data size. Coverage by the grammar was presented through empirical tests. It then reported the use of the NIST evaluation metric for the evaluation of the grammar when applied by the Survey Parser on test sets totalling 4,000 trees.</Paragraph> <Paragraph position="1"> Through the size of the grammar in terms of the five canonical phrases as a function of growth in training data size, it was observed that the learning curves for AJP, AVP, and VP culminated fairly rapidly with growing training data size. In contrast, NPs and PPs demonstrate a sharp learning curve, which may have suggested that there would be a lack of sufficient coverage by the grammar for these two phrase types. Experiments show that such a grammar still had a satisfactory coverage for these two with a near total coverage for the other three phrase types.</Paragraph> <Paragraph position="2"> The NIST scheme was used to evaluate the performance of the grammar when applied in the Survey Parser. An especially advantageous feature of the metric is the calculation of an overall parser performance rate that takes into account the total number of insertions in the parse tree, an important structural distortion factor when calculating the similarity between two trees. A total of 4,000 trees were used to evaluate the labelling and bracketing accuracies of the parse trees automatically produced by the parser. It is shown that the LR rate is over 86% and LP is about 84%. The bracketed recall is 85.8% with a bracketed precision of 82.8%.</Paragraph> <Paragraph position="3"> Finally, an attempt was made to estimate the combined performance score for both labelling and bracketing accuracies. The combined recall is 86.1% and the combined precision is 83.5.</Paragraph> <Paragraph position="4"> These results show both encouraging and promising performance by the grammar in terms of coverage and accuracy and therefore argue strongly for the case of inducing formal grammars from linguistically annotated corpora.</Paragraph> <Paragraph position="5"> A future research topic is the enhancement of the recall rate for clausal rules, which now stands at just over 65%. It is of great benefit to the parsing community to verify the impact the size of the grammar has on the performance of the parsing system and also to use a typologically more balanced corpus than WSJ as a workbench for grammar/parser development.</Paragraph> </Section> class="xml-element"></Paper>