File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/p85-1036_metho.xml
Size: 1,822 bytes
Last Modified: 2025-10-06 14:11:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P85-1036"> <Title>Comouters.---~r~-'~de-~, R~ode Island:</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> GRAMMATICAL ANALYSIS BY COMPUT~ OF THE LANCASTER-OSLO/BERGEN (LOB) CORPUS OF BRITISH ~NGLISH TEXTS. </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> Research has been under way at the Unit for Computer Research on the ~hglish Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form.</Paragraph> <Paragraph position="1"> The first phrase of the pruject, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. Over 93 per cent of the word tags were correctly selected by using a matrix of tag pair probabilities and this figure was upgraded by a further 3 per cent by retagging problematic strings of words prior to disambiguation and by altering the probability weightings for sequences of three tags. The remaining 3 to ~ per cent were corrected by a human post-editor. The system was originally designed to run in batch mode over the corpus but we have recently modified procedures to run interactively for sample sentences typed in by a user at a terminal. We are currently extending the word tag set and improving the word tagging procedures to further reduce manual intervention. A similar probabilistic system is being developed for phrase and clause tagging.</Paragraph> </Section> class="xml-element"></Paper>