File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-6006_intro.xml
Size: 11,162 bytes
Last Modified: 2025-10-06 14:03:00
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-6006"> <Title>The Syntactically Annotated ICE Corpus and the Automatic Induction of a Formal Grammar</Title> <Section position="2" start_page="0" end_page="51" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The International Corpus of English (ICE) is a project that aims at the construction of a collection of corpora of English in countries and regions where English is used either as a first or as an official language (Greenbaum 1992). Each component corpus comprises one million words of both written and transcribed spoken samples that are then annotated at grammatical and syntactic levels. The British component of the ICE corpus was used to automatically induce a large formal grammar, which was subsequently used in a robust parsing system. In what follows, this article will first of all describe the annotation schemes for the corpus and the evaluation of a formal grammar automatically induced from the corpus in terms of its potential coverage when tested with empirical data.</Paragraph> <Paragraph position="1"> Finally, this article will present an evaluation of the grammar through its application in a robust parsing system in terms of labelling and bracketing accuracies.</Paragraph> <Section position="1" start_page="0" end_page="49" type="sub_section"> <SectionTitle> 1.1 The ICE wordclass annotation scheme </SectionTitle> <Paragraph position="0"> There are altogether 22 head tags and 71 features in the ICE wordclass tagging scheme, resulting in about 270 grammatically possible combinations.</Paragraph> <Paragraph position="1"> Compared with 134 tags for LOB, 61 for BNC, and 36 for Penn Treebank, the ICE tagset is perhaps the most detailed in automatic applications. They cover all the major English word classes and provide morphological, grammatical, collocational, and sometimes syntactic information. A typical ICE tag has two components: the head tag and its features that bring out the grammatical features of the associated word. For instance, N(com,sing) indicates that the lexical item associated with this tag is a common (com) singular (sing) noun (N).</Paragraph> <Paragraph position="2"> Tags that indicate phrasal collocations include PREP(phras) and ADV(phras), prepositions (as in [1]) and adverbs (as in [2]) that are frequently used in collocation with certain verbs and adjectives: [1] Thus the dogs' behaviour had been changed because they associated the bell with the food.</Paragraph> <Paragraph position="3"> [2] I had been filming The Paras at the time, and Brian had had to come down to Wales with the records.</Paragraph> <Paragraph position="4"> Some tags, such as PROFM(so,cl) (pronominal so representing a clause as in [3]) and PRTCL(with) (particle with as in [4]), indicate the presence of a clause; so in [3] signals an abbreviated clause while with in [4] a non-finite clause: [3] If so, I'll come and meet you at the station. [4] The number by the arrows represents the order of the pathway causing emotion, with the cortex lastly having the emotion.</Paragraph> <Paragraph position="5"> Examples [5]-[7] illustrate tags that note special sentence structures. There in [5] is tagged as EXTHERE, existential there that indicates a marked sentence order. [6] is an example of the cleft sentence (which explicitly marks the focus), where it is tagged as CLEFTIT. [7] exemplifies anticipatory it, which is tagged as ANTIT: [5] There were two reasons for the secrecy.</Paragraph> <Paragraph position="6"> [6] It is from this point onwards that Roman Britain ceases to exist and the history of sub-Roman Britain begins.</Paragraph> <Paragraph position="7"> [7] Before trying to answer the question it is worthwhile highlighting briefly some of the differences between current historians.</Paragraph> <Paragraph position="8"> The verb class is divided into auxiliaries and lexical verbs. The auxiliary class notes modals, perfect auxiliaries, passive auxiliaries, semiauxiliaries, and semip-auxiliaries (those followed by -ing verbs). The lexical verbs are further annotated according to their complementation types. There are altogether seven types: complex transitive, complex ditransitive, copular, dimonotransitive, ditransitive, intransitive, monotransitive, and TRANS. Figure 1 shows the subcategorisations of the verb class.</Paragraph> <Paragraph position="9"> The notation TRANS of the transitive verb class is used in the ICE project to tag those transitive verbs followed by a noun phrase that may be the subject of the following non-finite clause. This type of verb can be analysed differently according to various tests into, for instance, monotransitives, ditransitives and complex transitives. To avoid arbitrary decisions, the complementing non-finite clause is assigned a catch-all term 'transitive complement' in parsing, and its preceding verb is accordingly tagged as TRANS in order to avoid making a decision on its transitivity type. This verb type is best demonstrated by [8]-[11]: [8] Just before Christmas, the producer of Going Places, Irene Mallis, had asked me to make a documentary on 'warm-up men'.</Paragraph> <Paragraph position="10"> [9] They make others feel guilty and isolate them.</Paragraph> <Paragraph position="11"> [10]I can buy batteries for the tape - but I can see myself spending a fortune! [11]The person who booked me in had his eyebrows shaved and replaced by straight black painted lines and he had earrings, not only in his ears but through his nose and lip! In examples [8]-[11], asked, make, see, and had are all complemented by non-finite clauses with overt subjects, the main verbs of these non-finite clauses being infinitive, present participle and past participle.</Paragraph> <Paragraph position="12"> As illustrated by examples [1]-[11], the ICE tagging scheme has indeed gone beyond the wordclass to provide some syntactic information and has thus proved itself to be an expressive and powerful means of pre-processing for subsequent parsing.</Paragraph> </Section> <Section position="2" start_page="49" end_page="51" type="sub_section"> <SectionTitle> 1.2 The ICE parsing scheme </SectionTitle> <Paragraph position="0"> The ICE parsing scheme recognises five basic syntactic phrases. They are adjective phrase (AJP), adverb phrase (AVP), noun phrase (NP), prepositional phrase (PP), and verb phrase (VP).</Paragraph> <Paragraph position="1"> Each tree in the ICE parsing scheme is represented as a functionally labelled hierarchy, with features describing the characteristics of each constituent, which is represented as a pair of function-category labels. In the case of a terminal node, the function-category descriptive labels are appended by the lexical item itself in curly brackets. Figure 2 is such a structure for [12].</Paragraph> <Paragraph position="2"> [12]We will be introducing new exam systems for both schools and universities.</Paragraph> <Paragraph position="3"> According to Figure 2, we know that [12] is a parsing unit (PU) realised by a clause (CL), which governs three daughter nodes: SU NP (NP as subject), VB VP (VP as verbal), and OD NP (NP as direct object). Each of the three daughter nodes are sub-branched until the leaves nodes with the input tokens in curly brackets. The direct object node, for example, has three immediate constituents: NPPR AJP (AJP as NP premodifier), NPHD N(com,plu) (plural common noun as the NP head), and NPPO PP (PP as NP post-modifier).</Paragraph> <Paragraph position="4"> Note that in the same example, the head of the complementing NP of the prepositional phrase is initially analysed as a coordinated construct (COORD), with two plural nouns as the conjoins (CJ) and a coordinating conjunction as coordinator (COOR).</Paragraph> <Paragraph position="5"> In all, there are 58 non-terminal parsing symbols in the ICE parsing scheme, compared with 20 defined in the Penn Treebank project. The Suzanne Treebank has 43 function/category symbols, discounting those that are represented as features in the ICE system.</Paragraph> <Paragraph position="6"> 2 The generation of a formal grammar The British component of the ICE corpus, annotated in fashions described above, has been used to automatically generate a formal grammar that has been subsequently applied in an automatic parsing system to annotate the rest of the corpus (Fang 1995, 1996, 1999). The grammar consists of two sets of rules. The first set describes the five canonical phrases (AJP, AVP, NP, PP, VP) as sequences of grammatical tags terminating at the head of the phrase. For example, the sequence AUX(modal,pres) AUX(prog,infin) V(montr,ingp) is a VP rule describing instantiations such as will be introducing in [12]. The second set describes the clause as sequences of phrase types. The string in [12], for instance, is described by a sequence NP VP NP PP in the set of clausal rules.</Paragraph> <Paragraph position="7"> To empirically characterise the grammar, the syntactically parsed ICE corpus was divided into ten equal parts according to the number of component texts. One part was set aside for testing, which was further divided into five test sets. The remaining nine parts were used as training data in a leave-one-out fashion. In this way, the training data was used to generate 9 consecutive training sets, each increased by one part over the previous set, with Set 1 formed of one training set, Set 2 two training sets, and Set 3 three training sets, etc. The evaluation thus not only aims to establish the potential coverage of the grammar but also to indicate the function between the coverage of the grammar and the training data size.</Paragraph> <Paragraph position="8"> Figure 3 shows the growth of the number of phrase structure rules as a function of the growth of training data size. The Y-axis indicates the number of rules generated from the training data and the X-axis the gradual increase of the training data size.</Paragraph> <Paragraph position="9"> function of growing training data size It can be observed that AJP and AVP show only a marginal increase in the number of different rules with the increase of training data size, therefore demonstrating a relatively small core set. In comparison, VPs are more varied but still exhibit a visible plateau of growth. The other two phrases, NP and PP, show a much more varied set of rules not only through their large numbers (9,184 for NPs and 13,736 for PPs) but also the sharp learning curve. There are many reasons for the potentially large set of rules for PPs since they structurally subsume the clause as well as all the phrase types. Their large number is therefore more or less expected. The large set of NP rules is however a bit surprising since they are often characterised, perhaps too simplistically, as comprising a determiner group, a premodifier group, and the noun head but the grammar has 9,184 different rules for this phrase type. While this phenomenon calls for further investigations, we are concerned with only the coverage issue for the moment in the current article.</Paragraph> </Section> </Section> class="xml-element"></Paper>