File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/c90-3030_intro.xml
Size: 8,999 bytes
Last Modified: 2025-10-06 14:04:53
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3030"> <Title>logy. A General Computational Model for Word-Form Recognition and Production. Department of General</Title> <Section position="2" start_page="0" end_page="168" type="intro"> <SectionTitle> 1. Outline </SectionTitle> <Paragraph position="0"> Grammars which are used in parsers are often directly imported from autonomous grammar theory and descriptive practice that were not exercised for the explicit purpose of parsing. Parsers have been designed for English based on e.g. Government and Binding Theory, Generalized Phrase Structure Grammar, and LexicaI-Functional Grammar. We present a formalism to be used for parsing where the grammar statements are closer to real text sentences and more directly address some notorious parsing problems, especially ambiguity. The formalism is a linguistic one. It relies on transitional probabilities in an indirect way. The probabilities are not part of the description.</Paragraph> <Paragraph position="1"> The descriptive statements, constraints, do not have the ordinary task of defining the notion 'correct sentence in L'. They are less categorical in nature, more closely tied to morphological features, and more directly geared towards the basic task of parsing. We see this task as one of inferring surface structure from a stream of concrete tokens in a basically bottom-up mode. Constraints are formulated on the basis of extensive corpus studies. They may reflect absolute, ruleqike facts, or probabilistic tendencies where a certain risk is judged to be proper to take. Constraints of the former rule-like type are of course preferable.</Paragraph> <Paragraph position="2"> The ensemble of constraints for language L constitute a Constraint Grammar (CG) for L. A CG is intended to be used by the Constraint Grammar Parser CGP, implemented as a Lisp interpreter.</Paragraph> <Paragraph position="3"> Our input tokens to CGP are morphologically analyzed word-forms. One central idea is to maximize the use of morphological information for parsing purposes. All relevant structure is assigned directly via lexicon, morphology, and simple mappings from morphology to syntax. \]he task of the constraints is basically to discard as many alternatives as possible, the optimum being a fully disambiguated sentence with one syntactic reading only.</Paragraph> <Paragraph position="4"> The second central idea is to treat morphological disambiguation and syntactic labelling by the same mechanism of discarding improper alternatives.</Paragraph> <Paragraph position="5"> A good parsing formalism should satisfy many requirements: the constraints should be declarative rather than procedural, they should be able to cope with any real-world text-sentence (i.e. with running text, not just with linguists' laboratory sentences), they should be clearly separated from the program code by which they are executed, the formalism should be language-independent, it should be rea~ sonably easy to implement (optimally as finite-state automata), and it should also be efficient to run. The CG formalism adheres to these desiderata.</Paragraph> <Paragraph position="6"> 2. Breaking up the problem of parsing The problem of parsing running text may be broken up into six subproblems or 'modules': tion, , determination of intrasentential clause boundaries, null , disalnbiguation of surface syntactic functions.</Paragraph> <Paragraph position="7"> The first four of these modules are executed sequentially, optimally followed by parallel execution of the last three modules which constitute 'syntax proper'. We have a five-stage parsing-process.</Paragraph> <Paragraph position="8"> In this general setting, CG is the formalism of the fifth stage, syntax proper. The same CG constraint formalism is used to disambiguate morphological and syntactic ambiguities, and to locate clause boundaries in a complex sentence. Parts of the CG forrnalism are used also in morphosyntactic mapping. Real texts are full with idiosyncracies in regard to headings, footnotes, paragraph structure, interpunctuation, use of upper and lower case, etc. Such phenomena must be properly normalized. Furthermore several purely linguistic phenomena must be somehow dealt with prior to single-word morphological analysis, especially idioms and other more or less fixed multi-word expressions. (It would e.g. make no sense to subject the individual words of the expression in spite of to plain morphological analysis.) The existence of an adequate preprocessor is here simply taken for granted.</Paragraph> <Paragraph position="9"> We concentrate on morphological analysis, clause boundary determination, morphological disambiguation, and syntactic function assignment. Viewing the problem of parsing in turn from one or another of these angles clarifies many intricacies. The subproblems take more manageable proportions and make possible a novel type of modularity.</Paragraph> <Paragraph position="10"> Morphological analysis is relatively independent.</Paragraph> <Paragraph position="11"> CGP is always supplied with adequate morphological input. The morphological analyzers are designed according to Koskenniemi's (1983) two-level model.</Paragraph> <Paragraph position="12"> Currently our Research Unit has morphological analyzers available for English (41,000 lexicon entries), Finnish (37,000 entries), and Swedish (42,000 entries). Below are two morphologically analyzed English word-forms, a has one reading, move four. The set of readings for a word-form we call a cohort. All readings in a cohort have the base-form initially or+ the line. Upper-case strings are morphological features except for those containing the designated initial character &quot;@&quot; which denotes that the string following it is the name of a syntactic function, here emanating from the lexicon. &quot;@DN>&quot; = determiner as modifier of the next noun to the right, &quot;@+FMAINV&quot; = finite main verb, &quot;@-FMAINV&quot; = non-finite main verb as member of a verb chain, &quot;@<NQM-FMAINV&quot; = non-finite main verb as post-modifier of a nominal: a a&quot; DET CENTR ART INDEF @DN>&quot; move move&quot; N NOM SG&quot; move &quot;V SUBJUNCTIVE @+FMAINV&quot; move &quot;V IMP @+FMAINV&quot; move&quot; V INF @-FMAINV @<NOM-FMAINV&quot; described by recursive links back to the main lexicon. Consider the cohort of the Swedish word-form frukosten (&quot;_ &quot; = compound boundary, frukost 'breakfast', fru 'mrs', kost'nutrition', ko 'cow', sten 'stone'): frukosten frukost&quot; N UTR DEF SG NOM&quot; fru_kost&quot; N UTR DEF SG NOM &quot; fru ko sten&quot; N UTR INDEF SG NOM &quot; By 'local disambiguation' we refer to constraints or strategies that make it possible to discard some readings just by local inspection of the current cohort, without invoking any contextual information. The present cohort contains three readings. An interesting local disambiguation strategy can now be stated: &quot;Discard all readings with more than the smallest number of compound boundaries occurring in the current cohort&quot;. This strategy properly discards the readings &quot;fru_kost&quot; and &quot;fru ko sten&quot;. I have found this principle to be very close to perfect.</Paragraph> <Paragraph position="13"> A similar principle holds for derivation: &quot;Discard readings with derivational elements if there is at least one non-derived reading available in the cohort&quot;. Other local disambiguation strategies compare multiple compound readings in terms of how probable their part of speech structure is (NNN, ANN, NVN, AAV, etc.).</Paragraph> <Paragraph position="14"> Local disambiguation is a potent module. The Swedish morphological analyzer was applied to a text containing some 840,000 word-form tokens. The following table shows cohort size N(r) in the first column. The second and third columns sllow the number of cohorts with the respective number of readings before (a) and after (b) local disambiguation. E.g., before local disambiguation there were 3830 word-forms with 6 readings but after local disambiguation only 312.</Paragraph> <Paragraph position="15"> Here, disambiguation refers to reduction of morphological ambiguities, optimally down to cohort size = 1. Sense disambiguation is not included (presently our lexical items have no sense descriptions).</Paragraph> <Paragraph position="16"> The subproblems of morphosyntactic mapping, morphological disambiguation, clause boundary location, and syntactic function determination are interrelated. E.g., for disambiguation it is useful to know the boundaries of the current clause, and to know as much as possible about its syntactic structure. An important aspect of the general problem is to work out the precise relations between these modules.</Paragraph> </Section> class="xml-element"></Paper>