File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/88/p88-1013_intro.xml
Size: 4,775 bytes
Last Modified: 2025-10-06 14:04:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P88-1013"> <Title>PROJECT APRIL -- A PROGRESS REPORT</Title> <Section position="2" start_page="0" end_page="104" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Project APRIL (Annealing Parser for ~al~tic Input Language) is constructing a software system that uses the stochastic optimization technique known as &quot;simulated annealing'&quot; (Kirkpatnck et al. 1983, van T ~rhoven & Aatts 1987) to parse authentic English inputs by seeking labelled trce-su~ctures that maximize a measure of plausibility defined in terms of empirical statistics on parse-tree configurations drawn from a dmahase of mavnolly parsed English toxL This approach is a response to the fact that &quot;real-life&quot; English, such as the m~u,Jial in the Lancaster-Oslo/Bergen Corpus on which our research focuses, does not appear to conform to a fixed set of grammatical rules.</Paragraph> <Paragraph position="1"> (On the LOB Corpus and the research background from which Project APRIL emerged, see Garside et al. (1987). A crude pilot version of the APRIL system was described in Sampson (1986).) Orthodox computational linguistics is heavily influenced by a concept of language according to which the set of all strings over the vocabulary of the language is partitioned into a class of grammatical strings, which possess analyses all parts of which conform to a finite set of rules defining the language, and a class of strings which are ungrammatical and for which the question of their grammatical stntcture accordingly does not arise. Even systems which set out to handle &quot;deviant&quot; sentences commonly do so by referring them to particular &quot;non-deviant&quot; sentences of which they are deemed to be distortions. In our wcck with authentic texts, however, we find the &quot;grammaticality&quot; concept unhelpful. It frequendy happens that a word-sequence occurs which violates some recognized rule of English grammar, yet any reader can understand the passage without difficulty, and it often seems unlikely that most readers would notice the violation. Furthermore, a problem which is probably even more troublesome for the rule-based approach is that there is an apparently endless diversity of constructious that no-one would be likely to describe as ungrammatical or devianL Impressionistically it appears that any attempt to state a finite set of rules covering everything that occurs in authentic English text is doomed to go on adding more rules as long as more text is examined; Sampson (1987) adduced objective evidence supporting this impression.</Paragraph> <Paragraph position="2"> Our approach, therefore, is to define a function which associates a figure of merit with any possible tree having labels drawn from a recoguized alphabet of grammatical categorysymbols; any input sentence is parsed by seeking the highest-valued tree possible for that sentence. The analysis process works the same way, whether the input is impeccably grammatical or quite bizarre. No conwast between legal and illegal labelled trees arises: a tree which would ordinarily be described as thoroughly illegal is in our terms just a tree whose figure of merit is relatively very poor.</Paragraph> <Paragraph position="3"> This conception of parsing as optimization of a function defined for all inputs seems to us not implausible as a model of how people understand language. But that is not our concern; what matters to us is that this model seems very fimitful for automatic language-processing systems. It has a theoretical dir,~lvantage by comparison with rule-based approaches: if an input is perfectly granunatical but contains many out-of-the-way (i.e. low fi'equency) constructions, the correct analysis may be assigned a low figure of merit relative to some alternative analysis which treats the sentence as an imperfect approximation to a structure composed of high-frequency constructions. However, our experience is that, in authentic English, &quot;trick sentences&quot; of this kind tend to be much rarer than textbooks of theoretical linguistics might lead one m imagine. Against this drawback our approach balances the advantage of robusmess. No input, no matter how bizarre, can can cause our system simply to fail to return any analysis. Our sponsors, the Royal Signals and Radar Establishment (an agency of the U.K. Ministry of Defence) 1 ar~ principally interested in speech analysis, and arguably this robusmess should be even more advantageous for spoken language, which makes little use of constructions that are legitimate but rechercM, while it contains a great dead that is sloppy or</Paragraph> </Section> class="xml-element"></Paper>