File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0306_intro.xml
Size: 9,735 bytes
Last Modified: 2025-10-06 14:05:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0306"> <Title>NPtool~ a detector of English noun phrases *</Title> <Section position="4" start_page="0" end_page="49" type="intro"> <SectionTitle> FULLSTOP </SectionTitle> <Paragraph position="0"> In this type of analysis, each word is provided with tags indicating e.g. part of speech, inflection, derivation, and syntactic function.</Paragraph> <Paragraph position="1"> * Morphological and syntactic descriptions are based on hand-coded linguistic rules rather than on corpus-based statistical models. They employ structural categories that can be found in descriptive grammars, e.g. \[Quirk, Greenbaum, Leech and Svartvik, 1985\].</Paragraph> <Paragraph position="2"> Regarding the at times heated methodological debate on whether statistical or rule-based information is to he preferred in grammatical analysis of running text (cf. e.g. \[Sampson, 1987a; Taylor, Grover and Briscoe, 1989; Church, 1992\]), we do not object to probabilistic methods in principle; nevertheless, it seems to us that rule-based descriptions are preferable bemuse they can provide for more accurate and reliable analyses than current probabilistic systems, e.g. part-of-speech taggers \[Voutilainen, Heikkil~ and Anttila, 1992; Voutilainen, forthcoming a\]. I Proba-IConsider for instance the question posed in \[Church, 1992\] whether lexical probabilities contribute more to morphological or parLor-speech disambiguation than context does. The ENGCG morphological disambiguator, which is entirely based on context rules, uniquely bilistic or heuristic techniques may still be a useful add-on to linguistic information, if potentially remaining ambiguities must be resolved - though with a higher risk of error.</Paragraph> <Paragraph position="3"> * In the design of our grammar schemes, we have paid considerable attention to the question on the resolvability of grammatical distinctions. In the design of accurate parsers of running text, this question is very important: if the description abounds with distinctions that can be dependably resolved only with extrasyntactic knowledge ~, then either the ambiguities due to these distinctions remain to burden the structure-based parser (as well as the potential application based on the analysis), or a guess, i.e. a misprediction, has to be hazarded.</Paragraph> <Paragraph position="4"> This descriptive policy brings with it a certain degree of shallowness; in terms of information content, a tag-based syntactic analysis is somewhere between morphological (e.g. part-of-speech) analysis and a conventional syntactic analysis, e.g. a phrase structure tree or a feature-based analysis. What we hope to achieve with this compromise in information content is the higher reliability of the proposed analyses. A superior accuracy could be considered as an argument for postulating a new, 'intermediary' level of computational syntactic description. For more details, see e.g. \[Voutilainen and Tapanainen, 1993\]. * Our grammar schemes are also learnable: according to double-blind experiments on manually assigning morphological descriptions, a 100% interjudge agreement is typical \[Voutilainen, forthcoming a\]. 3 * The ability to parse running text is of a high priority. Not only a structurally motivated description is important; in the construction of the parsing grammars and lexica, attention should also be paid to corpus evidence. Often a grammar rule, as we expr~s it in our parsing grammars, is formed as a generalisation 'inspired' by corpus observations; in this sense the parsing grammar is corpus-based.</Paragraph> <Paragraph position="5"> However, the description need not be restricted to the corpus observation: the linguist is likely to generalise over past experience, and this is not necessarily harmful - as long as the generalisations can also and correctly identifies more than 97% of all appropriate descriptions, and this is considerably more than the near90% success rate achieved with lexical probabilities alone \[Church, 1992\]. Moreover, note that in all, the ENGCG disaanbiguator identifies more than 99.5% of all appropriate descriptions; only, some 2-3% of all anMyses remain ambiguous and thus do not become uniquely identified.</Paragraph> <Paragraph position="6"> For more details, see \[Voutila.inen, forthcoming 1993\]. 2Witness, for instance, ambiguities due to adverbial attachment or modifier scope.</Paragraph> <Paragraph position="7"> \[Church, 1992\] probably indicates that in the case of debatable constructions, explicit descriptive conventions have not been consistently established. Only a carefully defined grammar scheme makes the evaluation of the accuracy of the parsing system a meaningful enterprise (see also \[Sampson, 1987b\]).</Paragraph> <Paragraph position="8"> be validated against representative test corpora.</Paragraph> <Paragraph position="9"> * At least in a practical application, a parsing grammar should assign the best available analysis to its input rather than leave many of the input utterances unrecognised e.g. as ill-formed. This does not mean that the concept of well-formedness is irrelevant for the present approach. Our point is simply: although we may consider some text utterance as deviant in one respect or another, we may still be interested in extracting as much information as possible from it, rather than ignore it altogether. To achieve this effect, the grammar rules should be used in such a manner 4 that no input becomes entirely rejected, although the rules as such may express categorical restrictions on what is possible or well-formed in the language.</Paragraph> <Paragraph position="10"> * In our approach, parsing consists of two main kinds of operation: i. Context-insensitive lookup of (alternative) descriptions for input words 2. Elimination of unacceptable or contextually il null legitimate alternatives.</Paragraph> <Paragraph position="11"> Morphological analysis typically corresponds to the lookup module: it produces the desired morphosyntactic analysis of the sentence, along with a number of inappropriate ones, by providing each word in the sentence with all conventional analyses as a list of alternatives. The grammar itself exerts the restrictions on permitted sequences of words and descriptors. In other words, syntactic analysis proceeds by way of ambiguity resolution or dlsambiguation: the parser eliminates ill-formed readings, and what 'survives' the grammar is the (syntactic) analysis of the input utterance. Since the input contains the desired analysis, no new structure will be built dtvdng syntactic analysis itself.</Paragraph> <Paragraph position="12"> * Our grammars consist of constraints - partim distributional definitions of morphosyntactic categories, such as parts of speech or syntactic functions. Each constraint expresses a piecemeal linear-precedence generalisation about the language, and they are independent of each other. That is, the constraints can be applied in any order: a true grammar will produce the same analysis, whatever the order.</Paragraph> <Paragraph position="13"> The grammarian is relatively free to select the level of abstraction at which (s)he is willing to express the distributional generalisation. In particular, also reference to very low-level categories is possible, and this makes for the accuracy of the parser: while the grammar will contain more or less abstract, feature-oriented rules, often it is also expedient to state further, more particular restrictions on more particular distributional classes, even at the word-form level. These 'smaller' rules do not contradict the more general rules; often it is sim4e.g. by ranking the graanmar rules in terms of compromisability ply the case that further restrictions can be imposed on smaller lexical classes s This flexibility in the grammar formalism greatly contributes to the accuracy of the parser \[Voutilainen, forthcoming a; Voutilainen, forthcoming 1993\].</Paragraph> <Paragraph position="14"> 2 Uses of a noun phrase parser The recognition and analysis of subclausal structural units, e.g. noun phrases, is useful for several purposes. Firstly, a noun phrase detector can be useful for research purposes: automatic large-scale analysis of running text provides the linguist with better means to conduct e.g. quantitative studies over large amounts of text.</Paragraph> <Paragraph position="15"> An accurate though somewhat superficial analysis can also serve as a 'preprocessor' prior to more ambitious, e.g. feature-based syntactic analysis. This kind of division of labour is likely to be useful for technical reasons. One major problem with e.g. unification-based parsers is parsing time. Now if a substantial part of the overall problem is resolved with more simple and efficient techniques, the task of the unification-based parser will become more manageable. In other words, the more expressive but computationally heavier machinery of e.g. the unification-based parser can be reserved entirely for the analysis of the descriptively hardest problems. The less complex parts of the overall problem can be tackled with more simple and efficient techniques.</Paragraph> <Paragraph position="16"> Regarding production uses, even lower levels of analysis can be directly useful. For instance, the detection of noun phrases can provide e.g. information management and retrieval systems with a suitable input for index term generation.</Paragraph> <Paragraph position="17"> Noun phrases can also serve as translation units; for instance, \[van der Eijk, 1993\] suggests that noun phrases are more appropriate translation units than words or part-of-speech classes.</Paragraph> </Section> class="xml-element"></Paper>