File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1103_intro.xml
Size: 3,413 bytes
Last Modified: 2025-10-06 14:05:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1103"> <Title>CLAWS4: THE TAGGING OF THE BRITISH NATIONAL CORPUS</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> The main purpose of this paper is to describe the CLAWS4 general-purpose grammatical tagger, used for the tagging of the 100-million-word British National Corpus, of which c.70 million words have been tagged at the time of writing (April 1994)) We will emphasise the goals of (a) gener~d-purpose adaptability, (b) incorporation of linguistic knowledge to improve quality ,and consistency, and (c) accuracy, measured consistently and in a linguistically informed way.</Paragraph> <Paragraph position="1"> The British National Corpus (BNC) consists of c.100 million words of English written texts and spoken transcriptions, sampled from a comprehensive range of text types. The BNC includes 10 million words of spoken h'mguage, c.45% of which is impromptu conversation (see Crowdy, forthcoming). It also includes ,an immense variety of written texts, including unpublished materials. The gr,'unmatical tagging of the corpus has therefore required the 'super-robustness' of a tagger which can adapt well to virtually all kinds of text. The tagger also has had to be versatile in dealing with different tagsets (sets of grammatical category labels-- see 3 below) and accepting text in varied input formats. For the purposes of the BNC, l, he tagger has been requircd both to accept and to output text in a corpus-oriented TEl-confonnant mark-up definition known as CDIF (Corpus Document Interchange Format), but within this format many variant fornaats (affecting, for example, segmentation into words and sentences) can be readily accepted. In addition, CLAWS al-XThe BNC is the result of a collaboration, supported by the Science mid Engineering Research Council (SERC Grant No. GR/F99847) ,and the UK Dep,'u'tment of Trade and Industry, between Oxford University Press (lead p~u'tner), Longman Group Ltd., ChambersHarrap, Oxford University Computer Services, the British Library and Lancaster University. We thank Elizabeth Eyes, Nick Smith, mid Andrew Wilson for their v,'duable help in the preparation of this paper.</Paragraph> <Paragraph position="2"> lows variable output formats: for the current tagger, these include (a) a vertically-presented format suitable for manual editing, and (b) a more compact horizontally-presented format often more suitable for end-users. Alternative output formats are also glowed with (c) so-called 'portmanteau tags', i.e. combinations of two alternative tags, where the tagger calculates there is insufficient evidence for safe dis,'unbiguation, and (d) with simplified 'plain text' malk-up for the human reader. (See Tables I and 2 for examples of output formats.) CLAWS4, the BNC tagger, 2 incorporates many features of adaptability such as the above. It &quot;also incorporates many refinements of linguistic analysis which have built up over 14 years: particularly in the construction and content of the idiom-tagging component (see 2 below). At the same time, there are still many improvements to be made: the claim that 'you can put together a tagger from scratch in a couple of months' (recently heard at a research conference) is, in our view, absurdly optimistic.</Paragraph> </Section> class="xml-element"></Paper>