File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1508_metho.xml
Size: 22,658 bytes
Last Modified: 2025-10-06 14:14:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1508"> <Title>Lexical Resource Reconciliation in the Xerox Linguistic Environment</Title> <Section position="4" start_page="54" end_page="57" type="metho"> <SectionTitle> 3 XLE Morphological Processing </SectionTitle> <Paragraph position="0"> One obvious obstacle to incorporating externally developed morphological analyzers, even ones based on finite-state technology, is that they may be supplied in a variety of data-structure formats. We overcame this obstacle by implementing special software interpreters for the different transducer formats. But, as we learned in an evolutionary process, a more fundamental problem in integrating these components with syntactic processing is to reconcile the analyses they produce with the needs of the grammar.</Paragraph> <Paragraph position="1"> Externally-available morphological components are, at present, primarily aimed at relatively undemanding commercial applications such as information retrieval. As such, they may have mistakes or gaps that have gone unnoticed because they have no effect on the target applications. For example, function words are generally ignored in IR, so mistakes in those words might go undetected; but those words are crucial in syntactic processing. And even correct analyses generally deviate in some respects from the characterizations desired by the grammar writer.</Paragraph> <Paragraph position="2"> These mistakes, gaps, and mismatches are often not easy to correct at the source. It may not be practical to refer problems back to the supplier as they are uncovered, because of inherent time lags, economic feasibility, or other factors. But the grammar writer may not have the tools, permissions, or skills, to modify the source specifications directly.</Paragraph> <Section position="1" start_page="55" end_page="55" type="sub_section"> <SectionTitle> 3.1 Basic Approach </SectionTitle> <Paragraph position="0"> It is often the case that repairs to an external transducer whose behavior is unsatisfactory can be described in terms of operations in the calculus of regular relations (Kaplan and Kay, 1994). Transducers that encode the necessary modifications can be combined with the given one by combinations of composition, union, concatenation, and other operators that preserve regularity.</Paragraph> <Paragraph position="1"> For example, suppose an externally created finite-state morphological analyzer for English maps surface forms such as &quot;email&quot; into stem+inflectionalmorpheme sequences such as &quot;email -I-Nmass&quot;, but that certain other relatively recent usages (e.g.</Paragraph> <Paragraph position="2"> &quot;email&quot; as a verb) are not included. A more adequate transducer can be specified as Exeernal U Addition, the union of the external transducer with a locally-defined modification transducer. XLE provides a facility for constructing simple Addition transducers from lists of the desired input/output pairs: email : email +Nsg emails: email +Npl email : email +Vpres The relational composition operator is also quite useful, since it enables the output of the external transducer to be transformed into more suitable arrangements. For example, suppose the given transducer maps &quot;buy&quot; into &quot;buy +VBase&quot;, where +VBase subsumes both the present-tense and infinitival interpretations. This is not a helpful result if these two interpretations are treated by separate entries in the LFG lexicon, but this problem can be repaired by the composition External o TagFst, where TagFst is a transducer specified as</Paragraph> <Paragraph position="4"> Finally, in cases where the output of the external analyzer is simply wrong, corrections can be developed again in a locally specified way and combined with a &quot;priority union&quot; operator that blocks the incorrect analyses. The priority union of regular relations A and B is defined as A Up B = A U lid(Dora(A)) o B\]. The relation A Up B contains all the pairs of strings in A together with all the pairs in B except those whose domain string is also in the domain of A. With this definition the expression Correction Up External describes a transducer that implements the desired overrides.</Paragraph> <Paragraph position="5"> In principle, these and other modifications to an externally supplied transducer can be built by using a full-scale regular expression compiler, such as the Xerox calculus described in (Karttunen et. al., 1997). The calculus can evaluate these corrective expressions in an off-line computation to create a single effective transducer for use in syntactic processing. However, in practice it is not always possible to perform these calculations. The external transducer might be supplied in a form (e.g. highly compressed) not susceptible to combination. Or (depending on the modifications) the resultant transducer may grow too large to be useful or take too long to create.</Paragraph> <Paragraph position="6"> Therefore, our approach is to allow transducer combining operations to be specified in a &quot;morphological configuration&quot; (morph-config) referenced from the XLE grammar configuration. The effects of the finite-state operations specified in the morph-config are simulated at run time, so that we obtain the desired behavior without performing the off-line calculations. In the case of union, the analysis is the combination of results obtained by passing the input string through both transducers. For composition, the output of the first transducer is passed as input to the second. For priority union, the first transducer is used in an attempt to produce an output for a given input; the result from the second transducer is used only if there is no output from the first.</Paragraph> <Paragraph position="7"> Specifying the transducer combining operations in the XLE morph-config rather than relying on a script for an offline computation has another important advantage: it gives the grammar writer a single place to look to understand exactly what behavior to expect from the morphological analyzer. It removes a serious audit-trail problem that would otherwise arise in large-scale grammar development.</Paragraph> </Section> <Section position="2" start_page="55" end_page="55" type="sub_section"> <SectionTitle> 3.2 The Role of Tokenization </SectionTitle> <Paragraph position="0"> The role of tokenization in a production-level parser is more than a matter of isolating substrings bounded by blanks or punctuation or, in generation, reinserting blanks in the appropriate places. The tokenizer must also undo various conventions of typographical expression to put the input into the canonical form expected by the morphological analyzer. Thus it should include the following capabilities in addition to substring isolation: . Normalizing: removing extraneous white-space from the string.</Paragraph> <Paragraph position="1"> * Editing: making prespecified types of modifications to the string, e.g., removing annotations or SGML tags in an annotated corpus.</Paragraph> <Paragraph position="2"> * Accidental capitalization handling: analyzing capitalized words, at least in certain contexts such as sentence-initial, as, alternatively, their that would typically appear under different syntactic nodes. That is, the elements of &quot;du&quot; fall under both PREP and NP, and those of &quot;John's&quot; within an NP and a POSS attached to the NP. Separating these in the tokenizing process avoids what would otherwise be unnecesssary grammatical complications.</Paragraph> <Paragraph position="3"> * Compound word isolation: determining which space-separated words should be considered as single tokens for purposes of morphological analysis and parsing, at least as an alternative, This function can range from simply shielding blank-containing words from separation, e.g., &quot;a priori&quot;, &quot;United States&quot; to recognizing and similarly shielding highly structured sequences such as date (Karttunen et. al., 1997) and time expressions. null Because a tokenizing transducer is not only language-specific but often specific to a particular application and perhaps also to grammar-writer preference, it may be necessary to build or modify the tokenizing transducer as part of the grammar-writing effort.</Paragraph> <Paragraph position="4"> Logically, the relationship between the effective finite-state tokenizer, which carries out all of the above functions, and the morphological analyzer can be defined as: tokenizer o \[morph analyzer @\]* That is, the tokenizer, which inserts token boundaries ((c)) between tokens, is composed with a cyclic transducer that expects a (c) after each word. However, the feasibility considerations applying to the of\[line compilation of morphological transducers apply as well to the pre-compilation of their composition with the effective tokenizer. So the XLE morph-config file provides for the specification of a separate tokenizer, and the cycle and composition in the expression above are simulated at run-time.</Paragraph> <Paragraph position="5"> Further, it is advantageous to allow the effective tokenizer to also be specified in a modular way, as regular combinations of separate transducers interpreted by run-time simulation. The need for modularity in this case is based on both potential transducer size, as combining some of the tokenizer functions can lead to unacceptably large transducers, and on the desirability of using the same morph-config specifications for both parsing and generation. Most of the tokenizer functions are appropriate to both of these processes but some, such as white-space normalization, are not. Our modular specification of the effective tokenizer allows the grammar writer to mark those components that are used in only one of the processes.</Paragraph> </Section> <Section position="3" start_page="55" end_page="57" type="sub_section"> <SectionTitle> 3.3 XLE Morph-Config </SectionTitle> <Paragraph position="0"> The current XLE morph-config provides for these capabilities in a syntactically-sugared form so that a grammar writer not interested in the finite-statecalculus can nevertheless combine externally supplied transducers with locally-developed ones, and can easily comprehend the expected behavior of the effective transducer.</Paragraph> <Paragraph position="1"> The general structure of the morph-config is illustrated by the following: The transducers named in the TOKENIZE section are applied in sequence to each input string, with the result being the same as that of applying their composition. However those prefixed\[ by P! or G! are applied only in parsing or generation respectively. The individual output tokens are submitted to the transducers specified in the ANALYZE sections. There may be two such sections. The ANALYZE USEFIRST section is borrowed directly from the Rank Xerox transducer &quot;lookup&quot; facility (Karttunen et. al., 1997). Each line in this section specifies an effective transducer equivalent to the result of composing all the transducers mentioned on that line. The effective transducer resulting from the entire section is equivalent to the combination, by priority union, of the effective transducers specified by the individual lines.</Paragraph> <Paragraph position="2"> So, in the above example of an ANALYZE USEFIRST section, the large main-morph, assumed to be externally supplied, handles most inputs, but its incorrect analyses are blocked by the preemptive resuits of the morph-override transducer* And if neither the override transducer nor the main transducer obtains an analysis during parsing, the transduction specified as the composition of an accent fixer (e.g., for correcting erroneous accents) with the main transducer is applied. The accent fixer is not used in generation, so that only properly accented strings are produced* In the ANALYZE USEALL section, transducers on a single line are again interpreted as composed, but the resultant transducers from separate lines are understood as combined by simple union with each other and with the (single) effective ANALYZE USEFIRST transducer. In this way the grammar writer can specify one or more small transducers delivering alternative analyses where the results of the external analyzer are correct, but additional analyses are needed.</Paragraph> <Paragraph position="3"> The meaning of all the specifications in a morph-config file can be represented as an expression in the finite-state calculus that describes a single effective transducer. This transducer maps from the characters of an input sentence to a finite-state machine that represents all possible sequences of morpheme/tag combinations. The network structure of that machine is isomorphic to the initial parse-chart. For our simple morph-config example, the effective parsing transducer is</Paragraph> </Section> </Section> <Section position="5" start_page="57" end_page="58" type="metho"> <SectionTitle> 4 XLE Lexicon Structures </SectionTitle> <Paragraph position="0"> The lexical extensions XLE makes to the GWB database setup are also aimed at obtaining broad coverage via the use of independently developed resources. External lexical resources may be derived from machine readable dictionaries or large corpora by means of corpus analysis tools that can automatically produce LFG lexical entries (for example, the tools described by Eckle and Held, 1996). However, the results of using such repositories and tools are relatively error-prone and cannot be used without correction. In this section we describe the conventions for combining lexical information from different sources.</Paragraph> <Paragraph position="1"> As in GWB, an entry in the XLE lexicon contains one or more subentries associated with different parts of speech, and separate entries for stems and affixes, e.g., These entries specify the syntactic categories for the different senses of the morphemes together with the LFG constraints that are appropriate to each sense* The syntactic category for a sense, the one that matches the preterminal nodes of the phrase-structure grammar, is created by packing together two components that are given in the entry, the major category indicator (N and V above) and a category-modifier (BASE and SFX). The noun and verb senses of &quot;cook&quot; are thus categorized as N-BASE and V-BASE, and the category of the plural morpheme is N-SFX. These categories are distinct from the N and V that are normally referenced by higher-level syntactic rules, and this distinction permits the grammar to contain a set of &quot;sublexical&quot; rules that determine what morphotactic combi- null nations are allowed and how the constraints for an allowable combination are composed from the constraints of the individual morphemes. The sublexical rule for combining English nouns with their inflectional suffixes is N --> N-BASE N-SFX Here the constraints for the N are simply the conjunction of the constraints for the base and suffix, and this rule represents the interpretation that is built in to the GWB system. By distinguishing the morphological categories from the syntactic ones and providing the full power of LFG rules to describe morphological compositions, the XLE system allows for much richer and more complicated patterns of word formation.</Paragraph> <Paragraph position="2"> Also as in GWB, the lexicon for a grammar can be specified in the configuration by an ordered list of identifiers for lexical entry collections such as:</Paragraph> </Section> <Section position="6" start_page="58" end_page="58" type="metho"> <SectionTitle> LEXENTRIES (CORPUS ENGLISH) (CORRECTIONS ENGLISH). </SectionTitle> <Paragraph position="0"> To find the effective definition for a given headword, the identified lexicons are scanned in order and the last entry for the headword is used. Thus the order of lexicon identifiers in an XLE configuration allows hand-created accurate entries to be identified later in the list and override all earlier (errorful) entries.</Paragraph> <Paragraph position="1"> This is a practical approach to correction; the alternative approach of manually correcting erroneous entries in situ requires the edits to be redone whenever new versions of the external lexicon are created.</Paragraph> <Paragraph position="2"> But more can be done, as discussed in the next section. null</Paragraph> <Section position="1" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 4.1 Lexicon Edit Entries </SectionTitle> <Paragraph position="0"> We found that an erroneous entry in an externally provided lexicon often contains many subentries that are completely acceptable and should not be thrown away. However, the complete overrides afforded by the configuration priority order do not allow these desirable subentries to be preserved.</Paragraph> <Paragraph position="1"> For this reason, we defined a new kind of lexical entry, called an &quot;edit entry&quot;, that allows more finely tuned integration of multiple entries for a headword. The category-specific subentries of an edit entry are prefixed by operators specifying their effects on the subentries collected from earlier definitions in the priority sequence. For example, the external lexicon might supply the following preposition and directional-adverb subentries for the word &quot;down&quot;: down P BASE @PREP;</Paragraph> </Section> </Section> <Section position="7" start_page="58" end_page="58" type="metho"> <SectionTitle> ADV BASE QDIRADV. </SectionTitle> <Paragraph position="0"> where QPREP and QDIRADF invoke the LFG templates that carry the constraints common to all prepositions and directional adverbs. The following edit entry specified later in the lexicon sequence can be used to add the less likely transitive verb reading (e.g. &quot;He downed the beer&quot;): down +V BASE @TRANS; ETC.</Paragraph> <Paragraph position="1"> The + operator indicates that this V subentry is to be disjoined with any previous verb subentries (none in this case). The ETC flag specifies that previous subentries with categories not mentioned explicitly in the edit entry (P and ADV in this case) are to be retained.</Paragraph> <Paragraph position="2"> We allow three other operators on a category C besides +. !C indicates that all previous category-C subentries for the headword should be replaced by the new one. -C prefixes otherwise empty subentries and specifies that all previous category-C subentries for the headword should be deleted. For example, if the above verbal meaning of &quot;down&quot; actually appeared in the external lexicon and was not desired, it could be deleted by an edit entry: down -V; ETC.</Paragraph> <Paragraph position="3"> Finally, =C indicates that all previous category-C subentries should be retained. This is useful when the flag ONLY appears instead of ETC. ONLY sets the default that the subentries for all categories not mentioned in the current entry are to be discarded; =C can be used to override that default just for the category C..</Paragraph> <Paragraph position="4"> Combinations of these operators and flags allow quite precise control over the assembly of a final effective entry from information coming from different sources. To give one final example, if the constraints provided for the adverbial reading were insufficiently refined, the correcting entry might be: down +V BASE @TRANS;</Paragraph> </Section> <Section position="8" start_page="58" end_page="59" type="metho"> <SectionTitle> !ADV BASE @DIRADV (&quot; ADV-TYPE) =VPADV-FINAL; </SectionTitle> <Paragraph position="0"> ETC.</Paragraph> <Paragraph position="1"> This adds the verbal entry as before, but replaces the previous adverb entry with a new set of constraints. Edit entries have been especially valuable in the German branch of the PARGRAM effort, which uses lexicons automatically derived from many sources. Subentries in the sequence of automatic lexicons are generally prefixed by + so that each lexicon can make its own contributions. Then manuallycoded lexicons are specified with highest precedence to make the necessary final adjustments.</Paragraph> <Section position="1" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 4.2 Default definitions and unknown words </SectionTitle> <Paragraph position="0"> Morphological analysis coupled with sublexical rules determines which categories a word belongs to and what suffixes it can take. Subentries in the LFG lexicons specify additional syntactic properties which for many words are not predictable from their morphological decompositions. There are large sets of words in each category, however, that have exactly the same syntactic properties; most common nouns, for example, do not take arguments. XLE provides for default specifications that permit many redundant individual entries to be removed from all lexicons. null The lexical entry for the special headword -Lunknown contains subentries giving the default syntactic properties for each category. The entry -Lunknown N BASE @CN; V BASE @TRANS.</Paragraph> <Paragraph position="1"> indicates that the default properties of morphological nouns are provided by the common-noun template (c)ON and that verbs are transitive by default. -Lunknoun is taken as the definition for each lower-case word at the beginning of the scan through the ordered list of lexicons, and these subentries will prevail unless they are overridden by more specific information about the word. Another special headword -LUnknown gives the default definition for upper-case words (e.g. for proper nouns in English).</Paragraph> <Paragraph position="2"> To illustrate, suppose that the morphological analyzer produces the following analysis doors : door -Np1 but that there is no explicit lexical entry for &quot;door&quot;. Edges for both N-SO and V-SC, with constraints drawn from the -Lunknown entry would be added to the initial chart, along with edges for the tags.</Paragraph> <Paragraph position="3"> But only the sublexical rule N --> N-BASE N-SFX would succeed, because the morphological analyzer supplied the -Npl tag and no verbal inflection. Thus there is no need to have a separate entry for &quot;door&quot; in the lexicon. This feature contributes to the robustness of XLE processing since it will produce the most common definition for words known to the morphology but not known to the lexicon.</Paragraph> </Section> </Section> class="xml-element"></Paper>