File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1506_metho.xml

Size: 16,401 bytes

Last Modified: 2025-10-06 14:08:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1506">
  <Title>Adapting Existing Grammars: The XLE Experience</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
@INT-BODY @INT-PUNCT
@HEADER .
</SectionTitle>
    <Paragraph position="0"> In the STANDARD grammar, the DECL-PUNCT macro is defined as in (12a). However, this must be modified in the EUREKA grammar because the punctuation is much sloppier and often does not occur at all; the EUREKA version is shown in (12b).</Paragraph>
    <Paragraph position="2"> The modular specifications that macros and templates provide allow rule behavior to be modified without having to copy the parts of the rule that do not change.</Paragraph>
    <Paragraph position="3"> XLE also has a mechanism for systematically modifying the behavior of all rules: the METARULEMACRO. For example, in order to parse labeled bracketed input, as in (2b), the WSJ grammar was altered so that constituents could optionally be surrounded by the appropriately labeled brackets. The METARULEMACRO is applied to each rule in the grammar and produces as output a modified version of that rule. This is used in the STANDARD grammar for coordination and to allow quote marks to surround any constituent. The METARULEMACRO is redefined for the WSJ to add the labeled bracketing possibilities for each rule, as shown in (13).</Paragraph>
    <Paragraph position="4">  copy of STANDARD surrounding quote .</Paragraph>
    <Paragraph position="5"> The CAT, BASECAT, and RHS are arguments to the METARULEMACRO that are instantiated to different values for each rule. RHS is instantiated to the right-hand side of the rule, i.e., the rule expansion. CAT and BASECAT are two ways of representing the left-hand side of the rule. For simple categories the CAT and BASECAT are the same (e.g. NP for the NP rule). XLE also allows for complex category symbols to specialize the expansion of particular categories in particular contexts. For example, the VP rule is parameterized for the form of its complement and its own form, so that VP[perf,fin] is one of the complex VP categories. When the METARULEMACRO applies to rules with complex left-side categories, CAT refers to the category including the parameters and the BASECAT refers to the category without the parameters. For the VP example, CAT is VP[perf,fin] and BASECAT is VP.</Paragraph>
    <Paragraph position="6"> In the definition in (13), LSB and RSB parse the brackets themselves, while the LABEL[ BASECAT] parses the label in the bracketing and matches it to the label in the tree (NP in (2b)); the consituent itself is the CAT. Thus, a label-bracketed NP is assigned the structure in (14).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
(14) NP
LSB LABEL[NP] NP RSB
[ NP-SBJ Kirk Horse ]
</SectionTitle>
    <Paragraph position="0"> These examples illustrate how the prioritized redefinition of rules and macros has enabled us to incorporate the STANDARD rules in grammars that are tuned to the special properties of the EUREKA and WSJ corpora.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Lexical Entries
</SectionTitle>
      <Paragraph position="0"> Just as for rules, XLE's override conventions make it possible to: add new lexical items or new part-of-speech subentries for existing lexical items; delete lexical items; and modify lexical items. In addition to the basic priority overrides, XLE provides for &amp;quot;edit lexical entries&amp;quot; (Kaplan and Newman, 1997) that give finer control over the construction of the lexicon. Edit entries were introduced as a way of reconciling information from lexical databases of varying degrees of quality, but they are also helpful in tailoring a STANDARD lexicon to a specialized corpus. When working on specialized corpora, such as the Eureka corpus, modifications to the lexicon are extremely important for correctly handling technical terminology and eliminating word senses that are not appropriate for the domain.</Paragraph>
      <Paragraph position="1"> Higher-priority edit lexical entries provide for operators that modify the definitions found in lower-priority entries. The operators can: add a subentry (+); delete a subentry ( ); replace a subentry (!); or retain existing subentries (=). For example, the STANDARD grammar might have an entry for button as in (15).</Paragraph>
      <Paragraph position="3"> However, the EUREKA grammar might not need the V entry but might require a special partname N entry. Assuming that the EUREKA lexicons are given priority over the STANDARD lexicons, the entry in  (16) would accomplish this.</Paragraph>
      <Paragraph position="4"> (16) button V ;</Paragraph>
      <Paragraph position="6"> Note that the lexical entries in (15) and (16) end with ETC. This is also part of the edit lexical entry system. It indicates that other lower-priority definitions of that lexical item will be retained in addition to the new entries. For example, if in another EUREKA lexicon there was an adjective entry for button with ETC, the V, N, and A entries would all be used. The alternative to ETC is ONLY which indicates that only the new entry is to be used. In our button example, if an adjective entry was added with ONLY, the V and N entries would be removed, assuming that the adjective entry occurred in the highest priority lexicon.</Paragraph>
      <Paragraph position="7"> This machinery provides a powerful tool for building specialized lexicons without having to alter the STANDARD lexicons.</Paragraph>
      <Paragraph position="8"> The EUREKA corpus contains a large number of names of copier parts. Due to their particular syntax and to post-syntactic processing requirements, a special lexical entry is added for each part name. In addition, the regular noun parse of these entries is deleted because whenever they occur in the corpus they are part names. A sample lexical is shown in (17); the ' is the escape character for the space.</Paragraph>
      <Paragraph position="10"> The first line in (17) states that separator finger can be a PART NAME and when it is, it calls a template PART-NAME that provides relevant information for the functional-structure. The second line removes the N entry, if any, as signalled by the before the category name.</Paragraph>
      <Paragraph position="11"> Because of the non-context free nature of Lexical Functional Grammar, it sometimes happens that extensions in one part of the grammar require a corresponding adjustment in other rules or lexical entries. Consider again the EUREKA 's plurals. The part-name UDH is singular when it appears without the 's and thus the morphological tag +Sg is appended to it. In the STANDARD grammar, the tag +Sg has a lexical entry as in (18a) which states that +Sg is of category NNUM and assigns sg to its NUM. However, if this is used in the EUREKA grammar, the sg NUM specification will clash with the pl NUM specification when UDH appears with 's, as seen in (7). Thus, a new entry for +Sg is needed which has sg as a default value, as in (18b). The first line of (18b) states that NUM must exist but does not specify a value, while the second line optionally supplies a sg value to NUM; when the 's is used, this option does not apply since the form already has a pl NUM value.</Paragraph>
      <Paragraph position="13"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Tokenizing and Morphological Analysis
</SectionTitle>
    <Paragraph position="0"> Tokenization and morphological analysis in XLE are carried out by means of finite state transductions.</Paragraph>
    <Paragraph position="1"> The STANDARD tokenizing transducer encodes the punctuation conventions of normal English text, which is adequate for many applications. However, the Eureka and WSJ corpora include strings that must be tokenized in non-standard ways. The Eureka part identifiers have internal punctuation that would normally cause a string to be broken up (e.g. the hyphen in PL1-B7), and the WSJ corpus is marked up with labeled brackets and part-of-speech tags that must also receive special treatment. An example of the WSJ mark-up is seen in (19).</Paragraph>
    <Paragraph position="2"> (19) [NP-SBJ Lloyd's, once a pillar of the world insurance market,] is/VBZ being/VBG shaken/VBN to its very foundation.</Paragraph>
    <Paragraph position="3"> Part-of-speech tags appear in a distinctive format, beginning with a / and ending with a , with the intervening material indicating the content of the tag (VBZ for finite 3rd singular verb, VBG for a progressive, VBN for a passive, etc.). The tokenizing transducer must recognize this pattern and split the tags off as separate tokens. The tag-tokens must be available to filter the output of the morphological analyzer so that only verbal forms are compatible with the tags in this example and the adjectival reading of shaken is therefore blocked.</Paragraph>
    <Paragraph position="4"> XLE tokenizing transducers are compiled from specifications expressed in the sophisticated Xerox finite state calculus (Beesley and Karttunen, 2002).</Paragraph>
    <Paragraph position="5"> The Xerox calculus includes the composition, ignore, and substitution operator discussed by Kaplan and Kay (1994) and the priority-union operator of Kaplan and Newman (1997). The specialized tokenizers are constructed by using these operators to combine the STANDARD specification with expressions that extend or restrict the standard behavior. For example, the ignore operator is applied to allow the part-of-speech information to be passed through to the morphology without interrupting the standard patterns of English punctuation.</Paragraph>
    <Paragraph position="6"> XLE also allows separately compiled transducers to be combined at run-time by the operations of priority-union, composition, and union. Priorityunion was used to supplement the standard morphology with specialized &amp;quot;guessing&amp;quot; transducers that apply only to tokens that would otherwise be unrecognized. Thus, a finite-state guesser was added to identify Eureka fault numbers (09-425), adjustment numbers (12-23), part numbers (606K2100), part list numbers (PL1-B7), repair numbers (2.4), tag numbers (P-102), and diagnostic code numbers (dC131).</Paragraph>
    <Paragraph position="7"> Composition was used to apply the part-of-speech filtering transducer to the output of the morphological analyzer, and union provided an easy way of adding new, corpus-specific terminology.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Optimality Marks
</SectionTitle>
    <Paragraph position="0"> XLE supports a version of Optimality Theory (OT) (Prince and Smolensky, 1993) which is used to rank an analysis relative to other possible analyses (Frank et al., 2001). In general, this is used within a specific grammar to prefer or disprefer a construction. However, it can also be used in grammar extensions to delete or include rules or parts of rules.</Paragraph>
    <Paragraph position="1"> The XLE implementation of OT works as follows.1 OT marks are placed in the grammar and are associated with particular rules, parts of rules, or lexical entries. These marks are then ranked in the grammar CONFIGURATION. In addition to a simple ranking of constraints which states that a construc- null than this, allowing for UNGRAMMATICAL and STOPPOINT marks as well. Only OT marks that are associated with NOGOOD are of interest here. For a full description, see (Frank et al., 2001).</Paragraph>
    <Paragraph position="2"> without it, XLE allows the marks to be specified as NOGOOD. A rule or rule disjunct which has a NOGOOD OT mark associated with it will be ignored by XLE. This can be used for grammar extensions in that it allows a standard grammar to anticipate the variations required by special corpora without using them in normal circumstances.</Paragraph>
    <Paragraph position="3"> Consider the example of the EUREKA 's plurals discussed in section 2.1. Instead of rewriting the N rule in the EUREKA grammar, it would be possible to modify it in the STANDARD grammar and include an OT mark, as in (20).</Paragraph>
    <Paragraph position="4"> (20) N original STANDARD N rules (PL: @(OT-MARK EUR-PLURAL)).</Paragraph>
    <Paragraph position="5"> The CONFIGURATION files of the STANDARD and EUREKA grammars would differ in that the STANDARD grammar would rank the EUR-PLURAL OT mark as NOGOOD, as in (21a), while the EUREKA grammar would simply not rank the mark, as in (21b).</Paragraph>
    <Paragraph position="6"> (21) a. STANDARD optimality order:</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
EUR-PLURAL NOGOOD
</SectionTitle>
    <Paragraph position="0"> b. EUREKA optimality order:</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
NOGOOD
</SectionTitle>
    <Paragraph position="0"> Given the OT marks, it would be possible to have one large grammar that is specialized by different OT rankings to produce the STANDARD, EUREKA, and WSJ variants. However, from a grammar writing perspective this is not a desirable solution because it becomes difficult to keep track of which constructions belong to standard English and are shared among all the specializations and which are corpus-specific. In addition, it does not distinguish a core set of slowly changing linguistic specifications for the basic patterns of the language, and thus does not provide a stable foundation that the writers of more specialized grammars can rely on.</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Maintenance with Grammar Extensions
</SectionTitle>
    <Paragraph position="0"> Maintenance is a serious issue for any large-scale grammar development activity, and the maintenance problems are compounded when multiple versions are being created perhaps by several different grammar writers. Our STANDARD grammar is now quite mature and covers all the linguistically significant constructions and most other constructions that we have encountered in previous corpus analysis. However, every now and then, a new corpus, even a specialized one, will evidence a standard construction that has not previously been accounted for. If specialized grammars were written by copying all the STANDARD files and then modifying them, the implementation of new standard constructions would tend to appear only in the specialized grammar. Our techniques for minimizing the amount of copying encourages us to implement new constructions in the STANDARD grammar and this makes them available to all other specializations.</Paragraph>
    <Paragraph position="1"> If a new version of a rule for a specialized grammar is created by copying the corresponding STANDARD rule, changes later made to the special rule will not automatically be reflected in the STANDARD grammar, and vice versa. This is the desired behavior when adding unusual, corpus-specific constructions. However, if the non-corpus specific parts of the new rule are modified, these modifications will not migrate to the STANDARD grammar. To avoid this problem, the smallest rule possible should be modified in the specialized grammar, e.g., modifying the N head rule instead of the entire NP. For this reason, having highly modularized rules and using macros and templates helps in grammar maintenance both within a grammar and across specialized grammar extensions.</Paragraph>
    <Paragraph position="2"> As seen above, the XLE grammar development platform provides a number of mechanisms to allow for grammar extensions without altering the core (STANDARD) grammar. However, there are still areas that could use improvement. For example, as mentioned in section 2, the CONFIGURATION file states which other files the grammar includes and how they are prioritized. The CONFIGURATION contains other information such as declarations of the governable grammatical functions, the distributive features, etc. As this information rarely changes with grammar extensions, it would be helpful for an extension configuration to incorporate by reference such additional parameters of the STANDARD configuration. Currently these declarations must be copied into each CONFIGURATION.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML