File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1104_intro.xml
Size: 4,604 bytes
Last Modified: 2025-10-06 14:05:40
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1104"> <Title>SYNTACTIC ANALYSIS OF NATURAL LANGUAGE USING LINGUISTIC RULES AND CORPUS-BASED PATTER.NS</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> We discuss surface-syntactic analysis of running text.</Paragraph> <Paragraph position="1"> Our purpose is to mark each word with a syntactic tag. The tags denote subjects, object, main verbs, adverbials, etc. They are listed in Appendix A.</Paragraph> <Paragraph position="2"> Our method is roughly following * Assign to each word all the possible syntactic tags. * Disambiguate words as much as possible using linguistic information (hand-coded rules). Ilere we avoid risks; we rather leave words ambiguous than guess wrong.</Paragraph> <Paragraph position="3"> * Use global patterns to form alternative sentence level readings. Those alternatiw~&quot; analyses are selected that match the strictest global pattern. \[f it does not accept any of the remaining readings, the second strictest pattern is used, and so on.</Paragraph> <Paragraph position="4"> * Use local patterns to rank the remaining readings.</Paragraph> <Paragraph position="5"> The local patterns contain possible contexts for syntactic functions. The ranking of the readings depends on the length of the contexts associated with the syntactic functions of the sentece.</Paragraph> <Paragraph position="6"> We use both linguistic knowledge, represented as rules, and empirical data collected from tagged corpora. We describe a new way to collect information from a tagged corpus and a way to apply it. In this paper, we are mainly concerned with exploiting the empirical data and combining two different kinds of parsers.</Paragraph> <Paragraph position="7"> *This work was done when the author worked in the Research Unit for Computational Linguistics at the University of Itelsinki.</Paragraph> <Paragraph position="8"> Our work is based on work done with ENGCG, the Constraint Grammar Parser of English \[Karlsson, 1990; Karlsson, 1994; Karlsson et al., 1994; Voutilainen, 1994\]. It is a rule-h~ed tagger and surface-syntactic parser that makes a very small numher of errors but leaves some words ambiguous i.e. it prefers ambiguity to guessing wrong. The morphological part-of-speech analyser leaves \[Voutilainen et al., 1992\] only 0.3 % of all words in running text without the correct analysis when 3-6 % of words still have two or Inore I analyses.</Paragraph> <Paragraph position="9"> Vontilainen, Ileikkil'5. and Anttila \[1992\] reported that the syntactic analyser leaves :3-3.5 % of words without the correct syntactic tag, and 15-20 % of words remain amhiguos. Currently, the error rate has been decreased to 2-2.5 % and ambiguity rate to 15 % by Tirao Jiirvinen \[1994\], who is responsible for tagging a 200 million word corpus using I'\]NGCG in the Bank of English project.</Paragraph> <Paragraph position="10"> Althought, the ENGCG parser works very well in part-of-speech tagging, the syntactic descriptions are still problematic. In the constraint grammar framework, it is quite hard to make linguistic generalisations that can be applied reliably. To resolve the remaining ambiguity we generate, by using a tagged corpus, a knowledge-base that contains information about both the general structure of the sentences and the local contexts of tim syntactic tags. The general structure contains information about where, for example, subjects, objects and main verbs appear and how they follow one another. It does not pay any attention to their potential modiliers. The modifier-head relations are resolved by using the local context i.e. by looking at what kinds of words there are in the neighbourhood. null The method is robust in the sense that it is ahle to handle very large corpora. Although rule-b~med parsers usually perlbrrn slowly, 0rot is not the ca.qe with ENGCG. With the English grammar, the Constraint Granun;~r Parser implementation by Pasi Tapanainen analyses 400 words 2 per second on a Spare-Station 10/:30. q'hat is, one million words are processed in about 40 minutes. 'l'he pattern parser for empirical patterns runs somewhat slower, about 100 words per second.</Paragraph> <Paragraph position="11"> 1 But even then some of tile original ,xlternative analyses are removed '2InchMing all steps of preprocessing, morphologlcM analysis, disambiguation and syntactic analysis. The speed of morphological disamblguation alone exceeds 1000 words per second.</Paragraph> </Section> class="xml-element"></Paper>