File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-1508_intro.xml

Size: 6,739 bytes

Last Modified: 2025-10-06 14:06:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1508">
  <Title>Lexical Resource Reconciliation in the Xerox Linguistic Environment</Title>
  <Section position="3" start_page="0" end_page="54" type="intro">
    <SectionTitle>
2 The GWB Data Base
</SectionTitle>
    <Paragraph position="0"> GWB provides a computational environment tailored especially for defining and testing grammars in the LFG formalism. Comprehensive editing facilities internal to the environment are used to construct and modify a data base of grammar elements of various types: morphologicalrules, lexical entries, syntactic rules, and &amp;quot;templates&amp;quot; allowing named abbreviations for combinations of constraints. (See Kaplan and Maxwell, 1996; Kaplan and Bresnan, 1982; and Kaplan, 1995 for descriptions of the LFG formalism.) Separate &amp;quot;configuration&amp;quot; specifications indicate how to select and assemble collections of these elements to make up a complete grammar, and alternative configurations make it easy to experiment with different linguistic analyses.</Paragraph>
    <Paragraph position="1"> This paper focuses on the lexical mapping process, that is, the overall process of translating between the characters in an input string and the initial edges of the parse-chart. We divide this process into the typical stages of tokenization, morphological analysis, and LFG lexicon lookup. In GWB tokenizing is accomplished with a finite-state transducer compiled from a few simple rules according to the methods  described by (Kaplan and Kay, 1994). It tokenizes the input string by inserting explicit token boundary symbols at appropriate character positions. This process can produce multiple outputs because of uncertainties in the interpretation of punctuation such as spaces and periods. For example, &amp;quot;I like Jan.&amp;quot; results in two alternatives (&amp;quot;I@like@Jan@.@&amp;quot; and &amp;quot;I@like@Jan.@.@&amp;quot;) because the period in &amp;quot;Jan.&amp;quot; could optionally mark an abbreviation as well as a sentence end.</Paragraph>
    <Paragraph position="2"> Morphological analysis is also implemented as a finite-state transducer again compiled from a set of rules. These rules are limited to describing only simple suffixing and inflectional morphology. The morphological transducer is arranged to apply to individual tokens produced by the tokenizer, not to strings of tokens. The result of applying the morphological rules to a token is a stem and one or more inflectional tags, each of which is the heading for an entry in the LFG lexicon. Morphological ambiguity can lead to alternative analyses for a single token, so this stage can add further possibilities to the alternatives coming from the tokenizer. The token &amp;quot;cooks&amp;quot; can be analyzed as &amp;quot;cook +NPL&amp;quot; or &amp;quot;cook +V3SG&amp;quot;, for instance.</Paragraph>
    <Paragraph position="3"> In the final phase of GWB lexical mapping, these stem-tag sequences are looked up in the LFG lexicon to discover the syntactic category (N, V, etc.) and constraints (e.g. (1&amp;quot; NUM)=PL) to be placed on a single edge in the initial parse chart. The category of that edge is determined by the particular combination of stems and tags, and the corresponding edge constraints are formed by conjoining the constraints found in the stem/tag lexical entries. Because of the ambiguities in tokenization, morphological analysis and also lexical lookup, the initial chart is a network rather than a simple sequence.</Paragraph>
    <Paragraph position="4"> The grammar writer enters morphological rules, syntactic rules, and lexical entries into a database.</Paragraph>
    <Paragraph position="5"> These are grouped by type into named collections.</Paragraph>
    <Paragraph position="6"> The collections may overlap in content in that different syntactic rule collections may contain alternative expansions for a particular category and different lexical collections may contain alternative definitions for a particular headword. A configuration contains an ordered list of collection names to indicate which alternatives to include in the active grammar.</Paragraph>
    <Paragraph position="7"> This arrangement provides considerable support for experimentation. The grammar writer can investigate alternative hypotheses by switching among configurations with different inclusion lists. Also, the inclusion list order is significant, with collections mentioned later in the list having higher precedence than ones mentioned earlier. If a rule for the same syntactic category appears m more than one included rule collection, or an entry for the same headword appears in more than one included lexical collection, the instance from the collection of highest precedence is the one included in the grammar.</Paragraph>
    <Paragraph position="8"> Thus the grammar writer can tentatively replace a few rules or lexical entries by placing some very small collections containing the replacements later in the configuration list.</Paragraph>
    <Paragraph position="9"> We constructed XLE around the same databaseplus-configuration model but adapted it to operate in the C/Unix world and to meet an additional set of user requirements. GWB is implemented in a residential Lisp system where rules and definitions on text files are &amp;quot;loaded&amp;quot; into a memory-based database and then selected and manipulated. In C/Unix we treat the files themselves as the analog of the GWB database. Thus, the XLE user executes either a &amp;quot;create-parser&amp;quot; or &amp;quot;create-generator&amp;quot; command to select a file containing one or more configurations and to select one of those configurations to specify the current grammar. The selected configuration, in turn, names a list of files comprising the data base, and identifies the elements in those files to be used in the grammar.</Paragraph>
    <Paragraph position="10"> This arrangement still supports alternative versions of lexical entries and rules, but the purpose is not just to permit easy and rapid exploration of these alternatives. The XLE database facilities also enable linguistic specifications from different sources and with different degrees of quality to be combined together in an orderly and coherent way. For XLE the goal is to produce efficient, robust, and broad coverage processors for parsing and generation, and this requires that we make use of large-scale, independently developed morphological analyzers and lexicons. Such comprehensive, well-engineered components exist for many languages, and can relieve the grammar developer of much of the effort and expense of accounting again for those aspects of language processing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML