File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-6006_metho.xml
Size: 14,498 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-6006"> <Title>The Syntactically Annotated ICE Corpus and the Automatic Induction of a Formal Grammar</Title> <Section position="3" start_page="51" end_page="99" type="metho"> <SectionTitle> 3 The coverage of the formal grammar </SectionTitle> <Paragraph position="0"> The coverage of the formal grammar is evaluated through individual rule sets for the five canonical phrase types separately. The coverage by the clausal rules will also be reported towards the end of this section.</Paragraph> <Section position="1" start_page="51" end_page="99" type="sub_section"> <SectionTitle> 3.1 The coverage of AJP rules </SectionTitle> <Paragraph position="0"> As Figure 4 suggests, the coverage of the grammar, when tested with the five samples, is consistently high - all above 99% even when the grammar was trained from only one ninth of the training set. The increase of the size of the training set does not show significant enhancement of the coverage.</Paragraph> </Section> <Section position="2" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 3.2 The coverage of AVP rules </SectionTitle> <Paragraph position="0"> Like AJP rules, high coverage can be achieved with a small training set since when trained with only one ninth of the training data, the AVP rules already showed a high coverage of above 99.4% and quickly approaching 100%. See</Paragraph> </Section> <Section position="3" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 3.3 The coverage of NP rules </SectionTitle> <Paragraph position="0"> Although lower than AVP and AJP discussed above, the NP rules show a satisfactorily high coverage when tested by the five samples. As can be seen from Figure 6, the initial coverage when trained with one ninth of the training data is generally above 97%, rising proportionally as the training data size increases, to about 99%.</Paragraph> <Paragraph position="1"> This seems to suggest a mildly complex nature of NP structures.</Paragraph> </Section> <Section position="4" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 3.4 The coverage of VP rules </SectionTitle> <Paragraph position="0"> VPs do not seem to pose significant challenge to the parser. As Figure 7 indicates, the initial coverage is all satisfactorily above 97.5%. Set 1 even achieved a coverage of over 98.5% when the grammar was trained with only one ninth of the training data. As the graph seems to suggest, the learning curve arrives at a plateau when trained with about half of the total training data, suggesting a centralised use of the rules.</Paragraph> </Section> <Section position="5" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 3.5 The coverage of PP rules </SectionTitle> <Paragraph position="0"> As is obvious from Figure 8, PPs are perhaps the most complex of the five phrases with an initial coverage of just over 70%. The learning curve is sharp, culminating between 85% and 90% with the full training data set. As far as parser construction is concerned, this phrase alone deserves special attention since it explains much of the structural complexity of the clause. Based on this observation, a separate study was carried out to automatically identify the syntactic functions of PPs.</Paragraph> </Section> <Section position="6" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 3.6 The coverage of clausal rules </SectionTitle> <Paragraph position="0"> Clausal rules present the most challenging problem since, as Figure 9 clearly indicates, their coverage is all under 67% even when trained with all of the training data. This observation seems to reaffirm the usefulness of rules at phrase level but the inadequacy of clause structural rules. Indeed, it is intuitively clear that the complexity of the sentence is mainly the result of the combination of clauses of various kinds.</Paragraph> </Section> <Section position="7" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 3.7 Discussion </SectionTitle> <Paragraph position="0"> This section presented an evaluation of the grammar in terms of its coverage as a function of growing training data size. As is shown, the parsed corpus resulted in excellent grammar sets for the canonical phrases, AJP, AVP, NP, PP, and VP: except for PPs, all the phrase structure rules achieved a wide coverage of about 99%.</Paragraph> <Paragraph position="1"> The more varied set for PPs demonstrated a coverage of nearly 90%, not as high as what is achieved for the other phrases but still highly satisfactory.</Paragraph> <Paragraph position="2"> The coverage of clause structure rules, on the other hand, showed a considerably poorer performance compared with the phrases. When all of the training data was used, these rules covered just over 65% of the testing data.</Paragraph> <Paragraph position="3"> In view of these empirical observations, it can be reliably concluded that the corpus-based grammar construction holds a promising approach in that the phrase structure rules generally have a high coverage when tested with unseen data. The same approach has also raised two questions at this stage: Does the high-coverage grammar also demonstrate a high precision of analysis? Is it possible to enhance the coverage of the clause structure rules within the current framework? 4 Evaluating the accuracy of analysis The ICE project used two major annotation tools: AUTASYS and the Survey Parser.</Paragraph> <Paragraph position="4"> AUTASYS is an automatic wordclass tagging system that applies the ICE tags to words in the input text with an accuracy rate of about 94% (Fang 1996a). The tagged text is then fed into the Survey Parser for automated syntactic analysis. The parsing model is one that tries to identify an analogy between the input string and a sentence is that already syntactically analysed and stored in a database (Fang 1996b and 2000).</Paragraph> <Paragraph position="5"> This parser is driven by the previously described formal grammar for both phrasal and clausal analysis. In this section, the formal grammar is characterised through an empirical evaluation of the accuracy of analysis by the Survey Parser.</Paragraph> </Section> <Section position="8" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 4.1 The NIST evaluation scheme </SectionTitle> <Paragraph position="0"> The National Institute of Science and Technology (NIST) proposed an evaluation scheme that looks at the following properties when comparing recognition results with the correct answer:</Paragraph> <Paragraph position="2"> Notably, the correct match rate is identical to the labelled or bracketed recall rate. The commonly used precision score is calculated as the total number of correct nodes over the sum of correct, substituted, and inserted nodes. The insertion score, arguably, subsumes crossing brackets errors since crossing brackets errors are caused by the insertion of constituents even though not every insertion causes an instance of crossing brackets violation by definition. In this respect, the crossing brackets score only implicitly hints at the insertion problem while the insertion rate of the NIST scheme explicitly addresses this issue.</Paragraph> <Paragraph position="3"> Because of the considerations above, the evaluations to be reported in the next section were conducted using the NIST scheme. To objectively present the two sides of the same coin, the NIST scheme was used to evaluate the Survey Parser in terms of constituent labelling and constituent bracketing before the two are finally combined to yield performance scores.</Paragraph> <Paragraph position="4"> In order to conduct a precise evaluation of the performance of the parser, the experiments look at the two aspects of the parse tree: labelling accuracy and bracketing accuracy.</Paragraph> <Paragraph position="5"> Labelling accuracy expresses how many correctly labelled constituents there are per hundred constituents and is intended to measure how well the parser labels the constituents when compared to the correct tree. Bracketing accuracy attempts to measure the similarity of the parser tree to that of the correct one by expressing how many correctly bracketed constituents there are per hundred constituents. In this section, the NIST metric scheme will be applied to the two properties separately before an attempt is made to combine the two to assess the overall performance of the Survey Parser.</Paragraph> <Paragraph position="6"> The same set of test data described in the previous section was used to create four test sets of 1000 trees each to evaluate the performance of the grammar induced from the training sets described earlier.</Paragraph> </Section> <Section position="9" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 4.2 Labelling Accuracy </SectionTitle> <Paragraph position="0"> To evaluate labelling accuracy with the NIST scheme, the method is to view the labelled constituents as a linear string with attachment bracketing removed. For [12], as an example, Figure 10 is a correct tree and Figure 11 is a parser-produced tree.</Paragraph> <Paragraph position="1"> [12] It was probably used in the Southern States as well.</Paragraph> <Paragraph position="2"> Figure 10: A correct tree for [12] After removing the bracketed structure, we then have two flattened sequences of constituent labels and compare them using the NIST scheme, which will yield the following statistics: Accordingly, we may concretely claim that there are 42 constituent labels according to the correct tree, of which 37 (88.1%) are correctly labelled by the parser, with 5 substitutions (11.9%), 0 deletion, and 6 insertions. The overall labelling accuracy is then calculated as 73.8%.</Paragraph> <Paragraph position="3"> A total of 4,000 trees, divided into four sets of 1,000 each, were selected from the test data to evaluate the labelling accuracy of the parser. Empirical results show that the parser achieved an overall labelling precision of over 80%.</Paragraph> <Paragraph position="4"> 86% or better in terms of correct match (labelled recall) and nearly 84% in terms of labelled precision rate for the four sets. About 10% of the constituent labels are wrong (Subs) with a deletion rate (Del) of about 3.5%. Counting insertions (Ins), the overall labelling accuracy by the parser is around 80%.</Paragraph> </Section> <Section position="10" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 4.3 Bracketing Accuracy </SectionTitle> <Paragraph position="0"> A second aspect of the evaluation of the grammar through the use of the Survey Parser involves the measuring of its attachment precision, an attempt to characterise the similarity of the parser-produced hierarchical structure to that of the correct parse tree. To estimate the precision of constituent attachment of a tree, a linear representation of the hierarchical structure of the parse tree is designed which ensures that wrongly attached non-terminal nodes are penalised only once if their sister and daughter nodes are correctly aligned.</Paragraph> <Paragraph position="1"> Table 2 shows that the parser achieved nearly 86% for the bracketed correct match and 82.8% for bracketing precision. Considering insertions and deletions, the overall accuracy according to the NIST scheme is about 77%.</Paragraph> <Paragraph position="2"> This indicates that for every 100 bracket pairs 77 are correct, with 23 substituted, deleted, or inserted. In other words, for a tree of 100 constituents, 23 edits are needed to conform to the correct tree structure.</Paragraph> </Section> <Section position="11" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 4.4 Combined accuracy </SectionTitle> <Paragraph position="0"> The combined score for both labelling and bracketing accuracy is achieved through representing both constituent labelling and unlabelled bracketing in a linear string described in the previous sections.</Paragraph> <Paragraph position="1"> Table 3 gives the total number of trees in the four test sets and the total number of constituents. The number of correct matches, substitutions, insertions and deletions are indicated and combined scores computed accordingly. The table shows that the parser scored 86% and 83.5% respectively for labelled recall and precision. It is also shown that the parser achieved an overall performance of about 79%. Considering that the scoring program tends to underestimate the success rate, it is reasonable to assume a real overall combined performance of 80%.</Paragraph> </Section> <Section position="12" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 4.5 Discussion </SectionTitle> <Paragraph position="0"> Although the scores for the grammar and the parser look both encouraging and promising, it is difficult to draw straightforward comparisons with other systems. Charniak (2000) reports a maximum entropy inspired parser that scored 90.1% average precision/recall when trained and tested with sentences from the Wall Street Journal corpus (WSJ). While the difference in precision/recall between the two parsers may indicate the difference in terms of performance between the two parsing approaches, there nevertheless remain two issues to be investigated. Firstly, there is the issue of how text types may influence the performance of the grammar and indeed the parsing system as a whole.</Paragraph> <Paragraph position="1"> Charniak (2000) uses WSJ as both training and testing data and it is reasonable to expect a fairly good overlap in terms of lexical co-occurrences and linguistic structures and hence good performance scores. Indeed, Gildea (2001) suggests that the standard WSJ task seems to be simplified by its homogenous style. It is thus yet to be verified how well the same system will perform when trained and tested on a more 'balanced' corpus such as ICE. Secondly, it is not clear what the performance will be for Charniak's parsing model when dealing with a much more complex grammar such as ICE, which has almost three times as many non-terminal parsing symbols. The performance of the Survey Parser is very close to that of an unlexicalised PCFG parser reported in Klain and Manning (2003) but again WSJ was used for training and testing and it is not clear how well their system will scale up to a typologically more varied corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>