File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1021_metho.xml
Size: 16,822 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1021"> <Title>LATTICE-BASED WORD IDENTIFICATION IN CLARE</Title> <Section position="4" start_page="159" end_page="160" type="metho"> <SectionTitle> 2 SPACES AND WORD BOUNDARIES </SectionTitle> <Paragraph position="0"> In general, typing errors are not just a matter of one intended input token being miskeyed as another one. Spaces between tokens may be deleted (so that two or more intended words appear as one) or inserted (so that one word appears as two or more). Multiple errors, involving both spaces and other characters, may be combined in the same intended or actual token. A reliable spelling corrector must allow for all these possibilities, which must, in addition, be distinguished from the use of correctly-typed words that happen to fall outside the system's lexicon.</Paragraph> <Paragraph position="1"> However, even in the absence of &quot;noise&quot; of this kind, spaces do not always correspond to lexical item boundaries, at least if lexical items are defined in a way that is most convenient for grammatical purposes. For example, &quot;special&quot; forms such as telephone numbers or e-mail addresses, which are common in many domains, may contain spaces. In CLARE, these are analysed using regular expressions (cf Grosz et al, 1987), which may include space characters. When such an expression is realised, an analysis of it, connecting non-adjacent vertices if it contains spaces, is added to the lattice.</Paragraph> <Paragraph position="2"> The complexities of punctuation are another source of uncertainty: many punctuation symbols have several uses, not all of which necessarily lead to the same way of segmenting the input. For example, periods may indicate either the end of a sentence or an abbreviation, and slashes may be simple word-internal characters (e.g. X11/Ne WS) or function lexically as disjunctions, as in \[1\] I'm looking for suggestions for vendors to deal with/avoid. 1 Here, the character string &quot;with/avoid&quot;, although it contains no spaces, represents three lexical items that do not even form a syntactic constituent.</Paragraph> <Paragraph position="3"> CLARE's architecture and formalism allow for all these possibilities, and, as an extension, also permit multiple-token phrases, such as idioms, to be defined as equivalent to other tokens or token sequences. This facility is especially useful when CLARE is being tailored for use in a particular domain, since it allows people not expert in linguistics or the CLARE grammar to extend grammatical coverage in simple and approximate, but often practically important, ways. For example, if an application developer finds that inputs such as &quot;What number of employees have cars?&quot; are common, but that the construction &quot;what number of ...&quot; is not handled by the grammar, he can define the sequence &quot;what number of&quot; as equivalent to &quot;how many&quot;. This will provide an extension of coverage without the developer needing to know how any of the phrases involved are treated in the grammar. Extending the grammar is, of course, a more thorough solution if the expertise is available; the phrasal equivalence suggested here will not, for example, aThese two examples are taken from the Sun-spots computer bulletin board.</Paragraph> <Paragraph position="4"> cope correctly with the query &quot;What number of the employees have cars?&quot;.</Paragraph> </Section> <Section position="5" start_page="160" end_page="161" type="metho"> <SectionTitle> 3 CLARE'S PROCESSING STAGES </SectionTitle> <Paragraph position="0"> The CLARE system is intended to provide language processing capabilities (both analysis and generation) and some reasoning facilities for a range of possible applications.</Paragraph> <Paragraph position="1"> English sentences are mapped, via a number of stages, into logical representations of their literal meanings, from which reasoning can proceed. Stages are linked by well-defined representations. The key intermediate representation is that of quasi logical .form (QLF; Alshawi, 1990, 1992), a version of slightly extended first order logic augmented with constructs for phenomena such as anaphora and quantification that can only be resolved by reference to context. The unification of declarative linguistic data is the basic processing operation.</Paragraph> <Paragraph position="2"> The specific task considered in this paper is the process of mapping single sentences from character strings to QLF. Two kinds of issue are therefore not discussed here. These are the problem of segmenting a text into sentences and dealing with any markup instructions (cf Futrelle et al, 1991), which is logically prior to producing character strings; and possible context-dependence of the lexical phenomena discussed, which would need to be dealt with after the creation of QLFs.</Paragraph> <Paragraph position="3"> In the analysis direction, CLARE's front end processing stages are as follows.</Paragraph> <Paragraph position="4"> 1. A sentence is divided into a sequence of clusters separated by white space.</Paragraph> <Paragraph position="5"> 2. Each cluster is divided into one or more tokens: words (possibly inflected), punctuation characters, and other items. Tokenization is nondeterministic, and so a lattice is used at this and subsequent stages.</Paragraph> <Paragraph position="6"> 3. Each token is analysed as a sequence of one or more segments. For normal lexical items, these segments are morphemes.</Paragraph> <Paragraph position="8"> The lexicon proper is first accessed at this stage.</Paragraph> <Paragraph position="9"> A variety of strategies for error recovery (including but not limited to spelling/ typing correction) are attempted on tokens for which no segmentation could be found. Edges without segmentations are then deleted; if no complete path remains, sentence processing is abandoned. Further edges, possibly spanning non-adjacent vertices, are added to the lattice by the phrasal equivalence mechanism discussed earlier.</Paragraph> <Paragraph position="10"> Morphological, syntactic and semantic stages then apply to produce one or more QLFs.</Paragraph> <Paragraph position="11"> These are checked for adherence to sortal (selectional) restrictions, and, possibly with the help of user intervention, one is selected for further processing.</Paragraph> <Paragraph position="12"> Because tokenization is nondeterministic and does not involve lexical access, it will produce many possible tokens that cannot be further analysed. If sentence \[1\] above were processed, with/avoid would be one such token. It is important that analyses are found for as many tokens and token sequences as possible, but that error recovery, especially if it involves user interaction, is not attempted unless really necessary. More generally, the system must decide which techniques to apply to which problem tokens, and how the results of doing so should be combined.</Paragraph> <Paragraph position="13"> CLARE's token segmentation phase therefore attempts to find analyses for all the single tokens in the lattice, and for any special forms, which may include spaces and therefore span multiple tokens. Next, a series of recovery methods, which may be augmented or re-ordered by the application developer, are applied. Globalmethods apply to the lattice as a whole, and are intended to modify its contents or create required lexicon entries on a scale larger than the individual token. Local methods apply only to single still-unanalysed tokens, and may either supply analyses for them or alter them to other tokens. The default methods, all of which may be switched on or off using system commands, supply facilities for inferring entries through access to an external machine-readable dictionary; for defining sequences of capitalized tokens as proper names; for spelling correction (described in detail in the next section); and for interacting with the user who may suggest a replacement word or phrase or enter the VEX lexical acquisition subsystem (Carter, 1989) to create the required entries.</Paragraph> <Paragraph position="14"> After a method has been applied, the lattice is, if possible, pruned: edges labelled by unanalysed tokens are provisionally removed, as are other edges and vertices that then do not lie on a complete path. If pruning succeeds (i.e. if at least one problem-free path remains) then token analysis is deemed to have succeeded, and unanalysed tokens (such as with/avoid) are forgotten; any remaining global methods are invoked, because they may provide analyses for token sequences, but remaining local ones are not. If full pruning does not succeed, any subpath in the lattice containing more unrecognized tokens than an alternative subpath is eliminated. Subpaths containing tokens with with non-alphabetic characters are penalized more heavily; this ensures that if the cluster &quot;boooks,&quot; is input, the token sequence &quot;boooks ,&quot; (in which &quot;boooks&quot; is an unrecognized token and &quot;,&quot; is a comma) is preferred to the single token &quot;boooks,&quot; (where the comma is part of the putative lexical item). The next method is then applied. 2</Paragraph> </Section> <Section position="6" start_page="161" end_page="162" type="metho"> <SectionTitle> 4 SEGMENTATION AND SPELLING CORRECTION </SectionTitle> <Paragraph position="0"> A fairly simple affix-stripping approach to token segmentation is adopted in CLARE be-Sin fact, for completeness, CLARE allows the application of two or more methods in tandem and will combines the results without any intermediate pruning. This option would be useful if, in a given application, two sources of knowledge were deemed to be about equally reliable in their predictions.</Paragraph> <Paragraph position="1"> cause inflectional morphological changes in English tend not to be complex enough to warrant more powerful, and potentially less efficient, treatments such as two-level morphology (Koskenniemi, 1983). Derivational morphological relationships typically involve semantic peculiarities as well, necessitating the definition of derived words in the lexicon in their own right.</Paragraph> <Paragraph position="2"> The rules for dividing clusters into :tokens have the same form as those for segmenting tokens into morphemes, and are processed by the same mechanism. Thus &quot;,&quot;, like, say, &quot;ed&quot;, is defined as a suffix, but one that is treated by the grammar as a separate word rather than a bound morpheme. Rules for punctuation characters are very simple because no spelling changes are ever involved.</Paragraph> <Paragraph position="3"> However, the possessive ending &quot;' s&quot; is treated as a separate word in the CLARE grammar to allow the correct analysis of phrases such as &quot;the man in the corner's wife&quot;, and spelling changes can be involved here. Like segmentation, tokenization can yield multiple results, mainly because there is no reason for a complex cluster like Mr. or King's not also to be defined as a lexical item.</Paragraph> <Paragraph position="4"> One major advantage of the simplicity of the affix-stripping mechanism is that spelling correction can be interleaved directly with it.</Paragraph> <Paragraph position="5"> Root forms in the lexicon are represented in a discrimination net for efficient access (cf Emirkanian and Bouchard, 1988). When the spelling corrector is called to suggest possible corrections for a word, the number of simple errors (of deletion, insertion, substitution and transposition; e.g. Pollock and Zamora, 1984) to assume is given. Normal segmentation is just the special case of this with the number of errors set to zero. The mechanism nondeterministically removes affixes from each end of the word, postulating errors if appropriate, and then looks up the resulting string in the discrimination net, again considering the possibility of error. 3 3This is the reverse of Veronis' (1988) algorithm, where roots are matched before affixes. However, it Interleaving correction with segmentation like this promotes efficiency in the following way. As in most other correctors, only up to two simple errors are considered along a given search path. Therefore, either the affix-stripping phase or the lookup phase is fairly quick and produces a fairly small number of results, and so the two do not combine to slow processing down. Another beneficial consequence of the interleaving is that no special treatment is required for the otherwise awkward case where errors overlap morpheme boundaries; thus desigend is corrected to designed as easily as deisgned or designde are.</Paragraph> <Paragraph position="6"> If one or more possible corrections to a token are found, they may either be presented to the user for selection or approval, or, if the number of them does not exceed a pre-set threshold, all be preserved as alternatives for disambiguation at the later syntactic or semantic stages. The lattice representation allows multiple-word corrections to be preserved along with single-word ones.</Paragraph> <Paragraph position="7"> It is generally recognized that spelling errors in typed input are of two kinds: competence errors, where the user does not know, or has forgotten, how to spell a wordi and performance errors, where the wrong sequence of keys is hit. CLARE's correction mechanism is oriented towards the latter. Other work (e.g.</Paragraph> <Paragraph position="8"> Veronis, 1988, Emirkanian and Bouchard, 1988, van Berkel and De Smedt, 1988) emphasizes the former, often on the grounds that competence errors are both harder for the user to correct and tend to make a worse impression on a human reader. However, Emirkanian and Bouchard identify the many-to-one nature of French spelling-sound correspondence as responsible for the predominance of such errors in that language, which they say does not hold in English; and material typed to CLARE tends to be processed further (for seems easier and more efficient to match affixes first, because then the hypothesized root can be looked up without having to allow for any spelling changes; and if both prefixes and suffixes are to be handled, as they are in CLARE, there is no obvious single starting point for searching for the root first.</Paragraph> <Paragraph position="9"> database access, translation, etc) rather than reproduced for potentially embarrassing human consumption. A performance-error approach also has the practical advantage of not depending on extensive linguistic knowledge; and many competence errors can be detected by a performance approach, especially if some straightforward adjustments (e.g. to prefer doubling to other kinds of letter insertion) are made to the algorithm.</Paragraph> <Paragraph position="10"> As well as coping quite easily with morpheme boundaries, CLARE's algorithm can also handle the insertion or deletion of word boundary spaces. For the token witha, CLARE postulates both with and with a as corrections, and (depending on the current switch settings) both may go into the lattice. The choice will only finally be made when a QLF is selected on sortal and other grounds after parsing and semantic analysis. For the token pair hey er, CLARE postulates the single correction never, because this involves assuming only one simple error (the insertion of a space) rather than two or more to &quot;correct&quot; each token individually. Multiple overlapping possibilities can also be handled; the input Th m n worked causes CLARE to transform the initial lattice th m n worked into a corrected lattice containing analyses of the words shown here:</Paragraph> <Paragraph position="12"> The edges labelled &quot;them&quot; and &quot;man/men&quot; are constructed first by the &quot;global&quot; spelling correction method, which looks for possible corrections across token boundaries. The edge for the token &quot;m&quot; is then removed because, given that it connects only to errorful tokens on both sides, it cannot form part of any potentially optimal path through the lattice.</Paragraph> <Paragraph position="13"> Corrections are, however, sought for &quot;th&quot; and &quot;n&quot; as single tokens when the local spelling correction method is invoked. The corrected lattice then undergoes syntactic and semantic processing, and QLFs for the sequences &quot;the man worked&quot; and &quot;the men worked&quot;, but not for any sequence starting with &quot;them&quot; or &quot;to&quot;, are produced.</Paragraph> </Section> class="xml-element"></Paper>