File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1092_intro.xml
Size: 2,568 bytes
Last Modified: 2025-10-06 14:05:36
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1092"> <Title>ANNOTATING 200 MILLION WORDS: THE BANK OF ENGLISH PROJECT</Title> <Section position="3" start_page="0" end_page="565" type="intro"> <SectionTitle> 2 PREPROCESSOR </SectionTitle> <Paragraph position="0"> The preprocesshlg moduh,s stan,ihu.di:+e the runnh+g text and tokenise it into a, fc.z'm suitable for I,hc:~ EN(VI'WOI, lexica\] aualyser.</Paragraph> <Paragraph position="1"> I'~N(~I(,'G has been developed so that it, takes lute ;tccotllll, Variolls Lexttl,'d codhlg cotwelfl;iofts \[l(al'\]sson, 1994\]. &quot;W'e In;ave develoF, ed i~reprocessing procedures furthc:r to ca.ter %r the: dillbrc.nl, t.ypes of nmrkup c,w.tes systema+tica.lly. Since l.exl;s usually cofne frolft varJotls sollrces+ t\]lel'<~ frilly be tlrldocllrnented i(\]\[osyllcra, cies or systelflat, ic cl'rops ill SOlfie S;II ill)leg.</Paragraph> <Paragraph position="2"> The inh)rmation cotweyed by the markup codes is lntilised iu the parshlg process. ,Ul-+lati,g the i~reprc.cesshtg module to achiew.~ t,he highest, possihie systematisat, iotl is therefore consldcred worthwhile. The present system cau ,,teal with any code properly if it is used unambignously in either a selfl,encc~delirrtitlng function @4'; codes indicai,ins headings, para.o'aph marke,'s), sentenlce-intertml function (e.g. I+cmt change codes)or w~,'d-iuternal (e.g. accent, codes) fmlction.</Paragraph> <Paragraph position="3"> Since I+relm~cessint,; is the lirst sl.ep helYu'e lexicat lilt.erht,e;, it. inldk'atcs the kinds el I dillk'ulties we are likc'ly to enccnmter. If error messages are produced at this st.aD~ , 1 do i.he nmcessa+ry adiustluenll.s l.c, the. preprof:f'SSOl' tlllLil iL 51!elll.q I,o iH'Odllce \[.he oIItpllt smoothly, l'h','ors in pr<'I~rocessing may occasio,mlly result in a tmmcation of lengthy passag;es of text, or even a crlls\]l.</Paragraph> <Paragraph position="4"> It is inq-wl.ant k~u' the utillsation of l, he corpus that no iuformatlolJ is lost during st.andardisal.iou.</Paragraph> <Paragraph position="5"> Therefore, we aim to marl+ all correcl;ious Jumle +.o the text. For exalul',le ~ I.he preprocessor hlsenq.s a code marking the era'reel.ion when it. sep;tr;tt.c's strings stnch as oJ'l\]Je &quot;rod andthc.</Paragraph> <Paragraph position="6"> Most errol's are not corrected, StlCh ;IS COllfttsioll of selltence, Imundal'ies, trtlltCatlOll ofselll,el|ces dllO~ I,o running h(!&dillgS ()r page nmnlmrs, znisl+lacenJent or douhlint; of blocks of text, etc.</Paragraph> </Section> class="xml-element"></Paper>