File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/p92-1021_intro.xml
Size: 3,403 bytes
Last Modified: 2025-10-06 14:05:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1021"> <Title>LATTICE-BASED WORD IDENTIFICATION IN CLARE</Title> <Section position="3" start_page="0" end_page="159" type="intro"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> In many language processing systems, uncertainty in the boundaries of linguistic units, at various levels, means that data are represented not as a well-defined sequence of units but as a lattice of possibilities. It is common for speech recognizers to maintain a lattice of overlapping word hypotheses from which one or more plausible complete paths are subsequently selected. Syntactic parsing, of either spoken or written language, frequently makes use of a chart or well-formed substring table because the correct bracketing of a sentence cannot (easily) be calculated deterministically. And lattices are also often used in the task of converting Japanese text typed in kana (syllabic symbols) to kanji; the lack of interword spacing in written Japanese and the complex morphology of the language mean that lexical items and their boundaries cannot be reliably identified without applying syntactic and semantic knowledge (Abe et al, 1986).</Paragraph> <Paragraph position="1"> In contrast, however, it is often assumed that, for languages written with interword spaces, it is sufficient to group an input character stream deterministically into a sequence of words, punctuation symbols and perhaps other items, and to hand this sequence to the parser, possibly after word-by-word morphological analysis. Such an approach is sometimes adopted even when typographically complex inputs are handled; see, for example, Futrelle et al, 1991.</Paragraph> <Paragraph position="2"> In this paper I observe that, for typed input, spaces do not necessarily correspond to boundaries between lexical items, both for linguistic reasons and because of the possibility of typographic errors. This means that a lattice representation, not a simple sequence, should be used throughout front end (preparsing) analysis. The CLARE system under development at SRI Cambridge uses such a representation, allowing it to deal straightforwardly with combinations or multiple occurrences of phenomena that would be difficult or impossible to process correctly under a sequence representation. As evidence for the performance of the approach taken, I describe an evaluation of CLARE's ability to deal with typing and spelling errors. Such errors are especially common in interactive use, for which CLARE is designed, and the correction of as many of them as possible can make an appreciable difference to the usability of a system. The word identity and word boundary ambiguities encountered in the interpretation of errorful input often require the application of syntactic and semantic knowledge on a phrasal or even sentential scale. Such knowledge may be applied as soon as the problem is encountered; however, this brings major problems with it, such as the need for adequate lookahead, and the difficulties of engineering large systems where the processing levels are tightly coupled. To avoid these difficulties, CLARE adopts a staged architecture, in which indeterminacy is preserved until the knowledge needed to resolve it is ready to be applied. An appropriate representation is of course the key to doing this efficiently.</Paragraph> </Section> class="xml-element"></Paper>