File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/p88-1026_metho.xml

Size: 16,972 bytes

Last Modified: 2025-10-06 14:12:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P88-1026">
  <Title>Lexicon and grammar in probabilistic tagging of written English.</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Text Corpora
</SectionTitle>
    <Paragraph position="0"> Historically, the use of text corpora to provide mnp/ncal data for tes~g gramm.~e.al theories has been regarded as important to varying degn~es by philologists and linguists of differing pe~msions. The use of co~us citations in ~-~,~ma~ and dictionaries pre~t~ electronic da~a processing (Brown.</Paragraph>
    <Paragraph position="1"> 1984: 34). While most of the generative 8r~-,-a,iam of the 60S and 70S ignored corpus ant,,: the inc~tsed power Of the new t~mlogy ,wenlw.l~ points the way to new applications of computerized text cmlxEa in dictiona~ makln~_: style checking and speech w, cognition. Compmer corpora present the computational linguist with the diversity and complexity of real language which is more challenging for testing language models than intuitively derived examples.</Paragraph>
    <Paragraph position="2"> Ultimately grammatl must be judged by their ability to contend with the teal facts of language and not just basic constructs extrapolated by grammm/ans.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="211" type="metho">
    <SectionTitle>
2. Word Tagging
</SectionTitle>
    <Paragraph position="0"> The system devised for automatic word tagging or part of speech selection for processing nmn/ng Enfli~ text, known as  the Constituent-Likelihood Automatic Word-tagging System (CLAWS) (Garside et aL, 1987) serves as the basis for the current work. The word tagging system is an automated  c~mponent of the probabilist/c parsing system we are curnmtly woddng on. In won/tagging, each of the rurmi.$ words in the coqms text to be processed is associated with a pre-termina/ symbol, denoting word class. In e.~enc~ the CLAWS suite can be conceplually divided imo two phases: tag assignment and tag selection.</Paragraph>
    <Paragraph position="1">  JB = attributive adjective; JJ = general adjective: NNI = singular~co~mon noun; I~S1 = noun of style or title; NP1 = singular proper noun; W0 : base form of lexical verb, VVD -- past tense of lex/cal verb; WG = qng form of lexical verb; VVN = past participle of lexical verb; %, @ = probability markers; :- = word initial capital marker.</Paragraph>
    <Paragraph position="2">  Tag assignmeat involves, for each input nmning word or punctuation mask. lexicon look-up, which provides one or more potential word tags for each input word or punctuation mark. The lexicon is a list of about 8,000 records containing fields for (1) the word form (2) the set of one or more ~u-~41da~ tabs denoting the wont's word class(es) with probability markers attached indicating three ~ levels of plrl0~tl~lity.</Paragraph>
    <Paragraph position="3"> Words not in the CLAWS lcxicoa me assigned potemial tabs either by suffixlist look-up, which attempts to match end characters of the input wo~ with a suffix in the ~ or, if the input word does not have a word.ending to match one of these enuies, default tags are assigned. The procedures emure that ~ words and neologL~as not: in the lezi~n .am given an analysis.</Paragraph>
    <Paragraph position="4">  Tag selection disambiguates the aRemative tags that are assigned to some of the running words. Disambiguafion is achieved by invoking one-step probabilities of tag pair E_~kelihoods exmtaed from a previously tagged training corpus and upgrading or downgrading likelihoods according to the probability markets against word tags in the lexicon or suffixlist. In the majority of cases, this first order Ma:kov model is sufficient to con~tly select the most likely of tags associated with the input nau~g text. (Over 90 per ant of running words am correctly disambiguatcd in this way.) Exceptions me dealt with by invoking a look up procedme that searches through a limited list of groups of two or more words, or by automatically adjus~ng the probabilities of sequences of three tags in cases where the intermediate tag is misleading.</Paragraph>
    <Paragraph position="5"> The curreat vemm of the CLAWS system requires no proediting and auribums the correct won1 tag to over 96 per cent of the input running words, leaving 3 to 4 per cast to be conectat by lmaum post.editom.</Paragraph>
  </Section>
  <Section position="5" start_page="211" end_page="211" type="metho">
    <SectionTitle>
3. Error Analysis
</SectionTitle>
    <Paragraph position="0"> En'm&amp;quot; analysis of CLAWS output has resulted, and ccminms to result, in diveaue imlaovemems to the system, from the simple adjustm~ of probability weightings against tags in the lexicon tO the inclusioa of additional procedures, for insum~ m deal wire fl~ dis~cflon l~m pn~r names Pare of the system can also be used to develop new parts, to extend ~ pans, or to interfaz with other systems. For instam~ in onler to lzaXlace a lexicon sufficiently large and denial mou~ for pm~t, we _~___d m ~ ~ ori~ Ust of almut &amp;000 enuies to or= 20,000 (the new CLAWS lexiccm C/oma~s almut 26,500 enn~es)..In onfer to do this, a list of 15,000 wools not alnmdy in the CLAWS lexicon was tagged msn~ the CLAWS tag as~gmnem program. (Since they wee not already in the lexicon, the candidate tags for each new amy were assigned by sut~axlim toolcup or default tag asaignmem.) The new list was rhea post-edited by interaJ~ive scum edi~ md m~ with the old l~icon.</Paragraph>
    <Paragraph position="1"> Anot/a~ example of 'self impmvemem' is in the pnxluaion of a better set of case-step tmmiticea probabilities. The first CLAWS system used a mat~ of tag trmsttion probabilities derived fnxn the tagged Brown corpus (F-nmcis and gu~em.</Paragraph>
    <Paragraph position="2"> 1982). Some cells of this matrix were inaccurate because of incompmilz'lity of the Brown tagset and the CS...AWS tagset. To remedy this, a new manix was created by a statistics-gathedng program that processed the post-edited version of a corpus of one million WOldS tagged by the ofigiglal CLAWS suite of programs.</Paragraph>
  </Section>
  <Section position="6" start_page="211" end_page="212" type="metho">
    <SectionTitle>
4. Subcategorization
</SectionTitle>
    <Paragraph position="0"> Apart ~ ~g tim vocaiml~ coverage of the CLAWS lexicon, we are also subcamgorizing words belonging to the major won1 classes in order to reduce thc over- null generation of alternative parses of semences of gx~tter than trivial lmgtlL The task of subcalegorizafion involves: (1) a linguist's specification of a schema or typology of lexical sulr.ategorics based ca distributional am1  functional cri~efi~ (2) a lexicographer's judgement in assigning one or more of the mbcategory codes in the linguist's schenm to the major lexical word forms (verbs, nouns, adjectives). The amount of detail demarcated by the sub~ttegodzation typology is dependent, in part, on the practical n~quinnne~s of the system. ~ subcategorization systems, such as the one provided in the Longman Dic~onary of Contempora~ English (1978) or Sager's (1981) sutr.atogories, need tO be taken into account. But these are assessed critically rather thaa adop~ wholesale (see for instanoe Akkenmm et al., 1985 and Boguraev et al., 1987, for a discussion of the strengths and wea~____~_ of the LDOCE grammar codes).</Paragraph>
    <Paragraph position="1"> \[I\] intran~tlve verb : ache, age, allow, care. conflict, escape. occur, mp~y, snow. stay, sun-bad~, swoon, talk, vanish. \[2\] transitive verb : abandon, abhor, a11ow, hoild, complete, contain, demand, exchange, get. give, house, keep, mail, master, oppose, pardo~ spend, sumSe~e~ warn.</Paragraph>
    <Paragraph position="2"> \[3\] copular verb : appear, become, feel, ~ grow, rfmain: seem.</Paragraph>
    <Paragraph position="3"> \[4\] prepositional verb : absWd~ aim, ask. belong, cater, consist, prey, pry, search, vote.</Paragraph>
    <Paragraph position="4"> \[5\] phrasal verb : blow, build, cry, dn~as, ease. farm, fill, hand, jazz, look, open, pop, sham, work.</Paragraph>
    <Paragraph position="5"> \[6\] vevb followed by that-danas : accept, believe, demlnd; doubt, feel, guess, know, ~ reckon, mqu~ think.</Paragraph>
    <Paragraph position="6"> \[7\] verb followed by to-infinitive : ask. come, dare, demand, fail, hope, intend, need, prefer, pmpese, refuse, seem, try, wish.</Paragraph>
    <Paragraph position="7"> \[8\] verb followed by -ing construction : abhor, begin. continue, deny, dislike, enjoy, keep, recall, l~'maember, risk, suggest.</Paragraph>
    <Paragraph position="8"> \[9\] ambltrans/tive verb : accept, answer, close, omnpile, cook, develop, feed, fly, move, obey, prm~ quit. sing, stop, teach. try.</Paragraph>
    <Paragraph position="9"> \[A\] verb habitually followed by an adverbial : appear, come, go, keep, lie, live, move, put. sit, stand, swim, veer. \[W\] verb followed by a wh-dause : ask, choose, doubt, imagine, know, matter, mind, wonder.</Paragraph>
    <Paragraph position="10">  We began subca~gorization of the CLAWS lexicon by word-tagging the 3,000 most frequem words in the Brown corpus (Ku~ra and Francis, 1967). An initial system of eleve~ verb subcategories was proposed, and judgame~s about which subcategory(ies) each verb belonged to wen: empirically tested by looking up ena'ies in the microfiche concordenoe of the tagged Lancaster/Oslo-Bergen corpus CHofland and Johansson, 1982; Johansson et aL, 1986) which shows every occur~nce of a tagged word in the corpus together with its contexL Ahout 2.500 verbs have been coded in this way, and we are now wo~ng on a more derailed system of about 80 diffem~ verb subcm~q~des using the Lexicon Development Em, imnmem of Bogumev et al. (1987).</Paragraph>
  </Section>
  <Section position="7" start_page="212" end_page="212" type="metho">
    <SectionTitle>
5. Constituent Analysis
</SectionTitle>
    <Paragraph position="0"> The task of implemem~ a p~ohabili~c ~ algorifl~n to provide a dismnbiguatod conmimant analysis of uormmcxod Enrich is mine demanding than implementing the word tagging suite, not least because, in order to operate in a  maonm&amp;quot; similar tO ~ wofd-tag~\[lg model, the system mcluims (1) specification of an appropriate grammar of rules and symbols and (2) the consuucfion of a sufficiently large d::.bank of parsed  smm~es conforming tO the (op~msD grammar specified in (1) tO provide suuistics of the relative likelihoods of cons~uem tag mmsitions for consfiutcot tag disambigumion.</Paragraph>
    <Paragraph position="1"> In order m meet these prior n~ptin~ms, researche~ have been employed on a full-time basis to assemble a corpus of parasd ~</Paragraph>
  </Section>
  <Section position="8" start_page="212" end_page="213" type="metho">
    <SectionTitle>
6. Grammar Development and Parsed
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="212" end_page="213" type="sub_section">
      <SectionTitle>
Subcorpora
</SectionTitle>
      <Paragraph position="0"> The databank of approximately 45,000&amp;quot; words of manually parsed semences of the Lancaster/Oslo-Bergen corpus (Sampson, 1987: 83ff) was processed to .show the disl/nct types of pmduodon ndas and ~ir fn~iue~ of occorrenco in gv,mmAr associated with the Sampson m:chank.</Paragraph>
      <Paragraph position="1"> of the UCR\]~ pmbabilistic syslz~ (Gandde and Leech, 1987: 66ff) and mgges~ons from other researchers prompdng new rules resulted in a new context-f~e grammar of about 6,000 pmductians cresting mine steeply nested slmcun~ than those of the Sampson g~anm~. (It was antici~m_!~ that steeper nesting would mduco the size of the m~ebank requin:d to obtain adequate f'n~luency stal~cs.) The new ~w-~rnar is defined descriptively in a Parser's Manual (Leech, 1987) and formaiLu~ as a set of context-free phrase-su~cmn: productions.</Paragraph>
      <Paragraph position="2"> Developmem of the grammar then proceeded in ~lem with the construc~n of a second ,~tnhank of parsed sentences, fitting, as closely as pos,~ole&amp;quot; the ralas expressed by the grammar. The new databank comprises extracts from newspaper r,~pons dining from 1979-80 in the Associated Press (A.P) corpus. Any difficolflas the grammarians had in parsing were resolved, whine appropriate, by amending or adding rules tO the grammar. This methodology resulted in the grammar  Ob = operator ~ of, or ending with, a form of/~, Od ffi operator consisting of, or ending with, a form of ~ Oh operator ~ of, or ending with, a form of the verb hart, V ffi main verb with complemmumiom V' ffi predicate;</Paragraph>
      <Paragraph position="4"> 7. Constructing the ParsedDambank For c~wenieme of ~ editing and compuu= pmcess~,, the constituent stmctmm are relamen~ in a linear form, as su-inss of ~-,~nafical words with labelled bracketing. The grammariam are givan prim-oum of post-C/diu~l output from the CLAWS suite. They then construct a consfime~ analysis for each sentence on the p~im-om, either in derail or in outline, according to the rules described in the Pamer's Mamufl, and key in tbeir sm~mms using an input program that checks for  well-fonnedne~ The wen-fonmsdv~ ~,t~ impo~,~l by the pmgr~ a~: (I) mat labe2s m legal non-umnin~ symhols (2) tl~ labelled brackm tmmce (3) that the productions obufined by the ~ analysis am  contained in the existing grammar.</Paragraph>
      <Paragraph position="5"> One se~ance is p~C/seraed at a time. Any mmrs found by the program a~ reported back to the sc~ean, once the grammarian has sent what s/he conside~ to be the completed prose. Sentences which are not well formed can be ~.edited or abandoned. A validity nuuker is appended to the w.f=enco for each sentence indicating ~ the semele has bean abandoned with errors contain~ in it.</Paragraph>
      <Paragraph position="7"/>
      <Paragraph position="9"> conjunction: IF = for as preposifiow, II = l~-posifion; IO = of as preposition; MC ffi cardinal number;, MD ffi ordinal number, NN2 ffi plural common noun; NNL2 ffi plural locative noun; NNTI = u~mporal noun; NNU = unit of measuremen~ RR = general adverb; VBR ffi are; $ ffi germanic genitive marker.</Paragraph>
      <Paragraph position="10"> 8. Assessing the Parsed Databank and the Grammar We have written ancillary prosrmn~ to help in the development of the tpmumar and to check the validity of the parses in the ~*.henk One program searches thnmgh the parsed dmtqmk for every occumm~ of a consfimant matching a specilied comfimem rag. Output is a list of all occurrances of the specil~ ~ together with fnxlucoc~ This facility allows selective searching through the 4-t-h~k, which is a ~0OI for revising p~rts of I11 grnmmar.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="213" end_page="214" type="metho">
    <SectionTitle>
9. Skeleton Parsing
</SectionTitle>
    <Paragraph position="0"> We are aiming to produce a millinn word corpus of parsed sentences by December 1988 so that we can implement a variant of the CYK algorithm (Hopemfl and Ullman, 1979: 140) m obtain a set of pames for each sentence. VRerbi labelling (Bahl et aL, 1983; Fomey, 1973) could be used to select the most pmbeble prose from ~e output paine set. But pmblmm associated with assembling a fully parsed datnhank (t) ~ of pmmmicm ml (2) .,,H~ the parsed dmalm~ m am evolving grammar.</Paragraph>
    <Paragraph position="1"> In order to cimmmvem these problems, a su~-gy of skeleum parsing hm been muoduced. In skeleton pms-ing, .gFmmn~mm cream&amp;quot; mininml labelled bracketing by inserting only those labelled bmckem that are unconuvversial and, in some cases, by insm~g brackets with no labels. The grammar validation routine is de-coupled from the input program so changes to the smmmar cam be made without disrupting the input parsing. The strategy also * prevems extrusive re~o~e editing whenever the grammar is modified.</Paragraph>
    <Paragraph position="2"> Grammar development and parsed a~t~nk ccmtmction are not mtiw.ly indeI~nd_ ~ however. A sulmet (I0 per cant) of the skeleton pames a~ ~ for comparison with the current grammar, wiule another subset (I per cent) is checked by il~ grnmmariai~.</Paragraph>
    <Paragraph position="3"> Skeleum parting win give us a partially parsed databank which should limit the alternative parses compatible with the final grammar. We can either assume each parse is equally likely and use the fiequency weighted productions generated by the paniaUy parsee d:tntmxk to upgrade or downgrade alternative parses or we can use a 'restrained' outsidefmside algerifl~m (Baker. 1979) to find the optimal parse.</Paragraph>
    <Paragraph position="4">  word rags: ICS = im~0os/tion.conjuncli~; IW = w/~, w/thou: as prepositions; PPHSI = he, she;, PPI-IS2 = they; PPI02 = m~. PPIS2 = we;, RT = nominal adverb of time; VM = modal auxiliary verb; ~,pert~r. S = incl~d~ sentence; S&amp; = first coordi-,,,'d main cJause; S+ = non-inital coordinated main clmu~ following a conjun~iom Si = inte~olated or appended sentence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML