File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1506_intro.xml
Size: 7,406 bytes
Last Modified: 2025-10-06 14:01:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1506"> <Title>Adapting Existing Grammars: The XLE Experience</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Priority-based Grammar Specialization </SectionTitle> <Paragraph position="0"> The XLE system is designed so that the grammar writer can build specialized grammars by both extending and restricting another grammar (in our case the base grammar is the STANDARD Pargram English grammar). An LFG grammar is presented to the XLE system in a priority-ordered sequence of files containing phrase-structure rules, lexical entries, abbreviatory macros and templates, feature declarations, and finite-state transducers for tokenization and morphological analysis. XLE is applied to a single root file holding a CONFIGURATION that identifies all the other files containing relevant linguistic specifications, that indicates how those components are to be assembled into a complete grammar, and that specifies certain parameters that control how that grammar is to be interpreted.</Paragraph> <Paragraph position="1"> A key idea is that there can be only one definition of an item of a given type with a particular name (e.g., there can be only one NP rule although that single rule can have many alternative expansions), and items in a higher priority file override lower priority items of the same type with the same name. This set up is similar to the priority-override scheme of the earlier LFG Grammar Writer's Workbench (Kaplan and Maxwell, 1996).</Paragraph> <Paragraph position="2"> This arrangement makes it relatively easy to construct a specialized grammar from a pre-existing standard. The specialized grammar is defined by a CONFIGURATION in its own root file that specifies the relevant STANDARD grammar files as well as the new files for the specialized grammar. The files for the specialized grammar can also contain items of different types (phrase-structure rules, lexical entries, templates, etc.), and they are ordered with higher priority than the STANDARD files.</Paragraph> <Paragraph position="3"> Consider the configuration for the EUREKA grammar. It specifies all of the STANDARD grammar files as well as its own rule, template, lexicon, and morphology files. A part of this configuration is shown in (3) (the notationtemplates.lfg are shared by all the languages' grammars, not just English).</Paragraph> <Paragraph position="4"> This configuration specifies that the EUREKA rules, templates, and lexical entries are given priority over the STANDARD items by putting the special EUREKA files at the end of the list. Thus, if the ../standard/english-rules.lfg and eureka-rules.lfg files both contain a rule expanding the NP category, the one from the STANDARD file will be discarded in favor of the EUREKA rule.</Paragraph> <Paragraph position="5"> In the following subsections, we provide several illustrations of how simple overriding has been used for the EUREKA and WSJ grammar extensions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Rules </SectionTitle> <Paragraph position="0"> The override convention makes it possible to: add rules (e.g., for new or idiosyncratic constructions); delete rules (e.g., to block constructions not found in the new corpus); and modify rules to allow different daughter sequences.</Paragraph> <Paragraph position="1"> Rules may need to be added to allow for corpus-specific constructions. This is illustrated in the EUREKA corpus by the identifier information that precedes each sentence, as in (1). In order to parse this substring, a new category (FIELD) was defined with an expansion that covers the identifier information followed by the usual ROOT category of the STANDARD grammar. The top-level category is one of the parameters of a configuration, and the EUREKA CONFIGURATION specifies that FIELD instead of the STANDARD ROOT is the start-symbol of the grammar. Thus the EUREKA grammar produces the tree in (4) and functional-structure in (5) for (1a).</Paragraph> <Paragraph position="2"> It is unusual in practice to need to delete a rule, i.e., to eliminate completely the possibility of expanding a given category of the STANDARD grammar. This is generally only motivated when the specialized grammar applies to a domain where certain constructions are rarely encountered, if at all. Although there has been no need to delete rules for the EUREKA and WSJ corpora, the override convention also provides a natural way of achieving this effect.</Paragraph> <Paragraph position="3"> For example, topicalization is extremely rare in the the Eureka corpus and the STANDARD topicalization rule sometimes introduces parsing inefficiency. This can be avoided by having the high priority EUREKA file replace the STANDARD rule with the one in (6).</Paragraph> <Paragraph position="4"> (6) CPtop .</Paragraph> <Paragraph position="5"> This vacuous rule expands the CPtop category to the empty language, the language containing no strings; so, this category is effectively removed from the grammar.</Paragraph> <Paragraph position="6"> Perhaps the most common change is to make modifications to the behavior of existing rules. The most direct way of doing this is simply to define a new, higher priority expansion of the same left-hand category. Since XLE only allows a single rule for a given category, the old rule is discarded and the new one comes into play. The new rule can be arbitrarily different from the STANDARD one, but this is not typically the case. It is much more common that the specialized version incorporates most of the behavior of the original, with minor extensions or restrictions. One way of producing the modified behavior is to create a new rule that includes a copy of some or all of the STANDARD rule's right side along with new material, and to give the new definition higher priority than the old. For example, plurals in the Eureka corpus can be formed by the addition of 's instead of the usual s, as in (7).</Paragraph> <Paragraph position="7"> (7) (CAUSE 27416 10) A 7mfd inverter motor capacitor was installed on an unknown number of UDH's.</Paragraph> <Paragraph position="8"> In order to allow for this, the N rule was rewritten to allow a PL marker to optionally occur after any N, as in (8).</Paragraph> <Paragraph position="9"> (8) N copy of STANDARD N rule (PL) As a result of this rule modification, UDH's in (7) will have the tree and functional-structure in (9). (9) a. N Copying material from one version to another is perhaps reasonable for relatively stable and simple rules, like the N rule, but this can cause maintainability problems with complicated rules in the STANDARD grammar that are updated frequently. An alternative strategy is to move the body of the STANDARD N rule to a different rule, e.g., Nbody, which in turn is called by the N rule in both the STANDARD and EUREKA grammars. The Nbody category can be supressed in the tree structure by invoking this rule as a macro (notationally indicated as @Nbody).</Paragraph> <Paragraph position="10"> (10) N @Nbody (PL).</Paragraph> <Paragraph position="11"> Often the necessary modification can be made simply by redefining a macro that existing rules already invoke. Consider the ROOT rule, in (11). (11) ROOT @DECL-BODY @DECL-PUNCT</Paragraph> </Section> </Section> class="xml-element"></Paper>