XML Viewer - j89-4003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/j89-4003_intro.xml
Size: 13,507 bytes
Last Modified: 2025-10-06 14:04:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="J89-4003">
  <Title>A FORMAL MODEL FOR CONTEXT-FREE LANGUAGES AUGMENTED WITH REDUPLICATION</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Context-free grammars are a recurrent theme in many, perhaps most, models of natural language syntax. It is close to impossible to find a textbook or journal that deals with natural language syntax but that does not include parse trees someplace in its pages. The models used typically augment the context-free grammar with some additional computational power so that the class of languages described is invariably larger, and often much larger, than the class of context-free languages.</Paragraph>
    <Paragraph position="1"> Despite the tendency to use more powerful models, examples of natural language constructions that require more than a context-free grammar for weak generative adequacy are rare. (See Pullum and Gazdar 1982, and Gazdar and Pullum 1985, for surveys of how rare inherently noncontext-free constructions appear to be.) Moreover, most of the examples of inherently noncontext-free constructions in natural languages depend on a single phenomenon, namely the reduplication, or approximate reduplication, of some string. Reduplication is provably beyond the reach of context-free grammars.</Paragraph>
    <Paragraph position="2"> The goal of this paper is to present a model that can accommodate these reduplication constructions with a minimal extension to the context-free grammar model.</Paragraph>
    <Paragraph position="3"> Reduplication in its cleanest, and most sterile, form is represented by the formal language {ww I w ~ E*}, where E is some finite alphabet. It is well known that this language is provably not context-free. Yet there are numerous constructs in natural language that mimic this formal language. Indeed, most of the known, convincing arguments that some natural language cannot (or almost cannot) be weakly generated by a context-free grammar depend on a reduplication similar to the one exhibited by this formal language. Examples include the respectively construct in English (Bar-Hillel and Shamir 1964), noun-stem reduplication and incorporation in Mohawk (Postal 1964), noun reduplication in Bambara (Culy 1985), cross-serial dependency of verbs and objects :in certain subordinate clause constructions in Dutch (Huybregts 1976; Bresnan et al. 1982) and in Swiss-German (Shieber 1985), and various reduplication constructs in English, including the X or no X construction as in: &amp;quot;reduplication or no reduplication, I want a parse tree&amp;quot; (Manaster-Ramer 1983, 1986). The model presented here can generate languages with any of these constructions and can do so in a natural way.</Paragraph>
    <Paragraph position="4"> To have some concrete examples at hand, we will review a representative sample of these constructions.</Paragraph>
    <Paragraph position="5"> The easiest example to describe is the noun reduplication found in Bambara. As described by Culy (1985), it is an example of the simplest sort of reduplication in which a string literally occurs twice. From the noun w Bambara can form w-o-w with the meaning &amp;quot;whatever w.&amp;quot; It is also possible to combine nouns productively in other ways to obtain new, longer nouns. Using these longer nouns in the w-o-w construction produces reduplicated strings of arbitrary length.</Paragraph>
    <Paragraph position="6"> The respectively construction in English is one of the oldest well-known examples of reduplication (Bar-Hillel Copyright 1989 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage and the CL reference and this copyright notice are included on the first page. To copy otherwise, or to republish, requires a fee and/or specific permission. 0362-613 X/89/010250-261 $ 03.00 250 ComputationaH Linguistics, Volume 15, Number 4, December 1989 Walter J. Savitch A Formal Model for Context-Free Languages Augmented with Reduplication and Shamir 1964). It provides an example of reduplication other than exact identity. A sample sentence is: John, Sally, Mary, and Frank are a widower, widow, widow, and widower, respectively.</Paragraph>
    <Paragraph position="7"> In these cases, it has been argued that the names must agree with &amp;quot;widow&amp;quot; or &amp;quot;widower&amp;quot; in gender, and hence the string from {widow, widower}* must be an approximate reduplication of the string of names. If one accepts the data, then this is an example of reduplication using a notion of equivalence other than exact identity. In this case the equivalence would be that the second string is a homomorphic image of the first one.</Paragraph>
    <Paragraph position="8"> However, one must reject the data in this case. One can convincingly argue that the names need not agree with the corresponding occurrence of &amp;quot;widow&amp;quot; or &amp;quot;widower,&amp;quot; because gender is not syntactic in this case. It may be false, but it is not ungrammatical to say &amp;quot;John is a widow.&amp;quot; (Perhaps it is not even false, since no syntactic rule prevents parents from naming a daughter &amp;quot;John.&amp;quot;) However, such dependency is at least potentially possible in some language with truly syntactic gender markings. Kac et al. (1987) discuss a version of this construction using subject-verb number agreement that yields a convincing argument that English is not a context-free language.</Paragraph>
    <Paragraph position="9"> One of the least controversial arguments claiming to prove that a particular natural language cannot be weakly generated by a context-free grammar is Shieber's (1985) argument about Swiss-German. In this case, the reduplication occurs in certain subordinate clauses such as the following: ... mer em Hans es huus h/ilfed aastriiche ... we Hans-DAT the house-ACC helped paint '....we helped Hans paint the house.' where to obtain a complete sentence, the above should be preceded by some string such as &amp;quot;Jan s/iit das&amp;quot; ('Jan says that'). In this case, a list of nouns precedes a list of an equal number of verbs and each noun in the list serves as the object of the corresponding verb. The cross-serial dependency that pushes the language beyond the reach of a context-free grammar is an agreement rule that says that each verb arbitrarily demands either accusative ordative case for its object. Thus if we substitute &amp;quot;de Hans&amp;quot; (Hans-ACC) for &amp;quot;em Hans&amp;quot; (Hans-DAT) or &amp;quot;em huus&amp;quot; (the house-DAT) for &amp;quot;es huus&amp;quot; (the house-ACC), then the above is ungrammatical because &amp;quot;h/ilfed&amp;quot; demands that its object be in the dative case and &amp;quot;aastriiche&amp;quot; requires the accusative case. Since the lists of nouns and verbs may be of unbounded length, this means that Swiss-German contains substrings of the forms N 1 N2&amp;quot;'Nn VI V2... Vn where n may be arbitrarily large and where each noun N i is in either the dative or accusative case depending on an arbitrary requirement of the verb V i.</Paragraph>
    <Paragraph position="10"> Bresnan et al. (1982) describe a similar construction in Dutch in which the strong agreement rule is not present and so the language (at least this aspect of it) can be weakly generated by a context-free grammar, even though it cannot be strongly generated by a context-free grammar. The context-free grammar to generate the strings would pair nouns and verbs in a mirror image manner, thereby ensuring that there are equal numbers of each. Since Dutch does not have the strong agreement rule that Swiss-German does, this always produces a grammatical clause, even though the pairing of nouns and verbs is contrary to intuition.</Paragraph>
    <Paragraph position="11"> However, in cases such as this, it would be desirable to have a model that recognizes reduplication as reduplication rather than one that must resort to some sort of trick to generate weakly the reduplicated strings in a highly unnatural manner. This is true even if one is seeking only weak generative capacity because, as the Dutch/Swiss-German pair indicates, if a minor and plausible addition to a construct in one natural language would make it demonstrably noncontext-free, then we can suspect that some other language may exhibit this or a similar inherently noncontext-free property, even when considered purely as a string set.</Paragraph>
    <Paragraph position="12"> Some of these arguments are widely accepted. Others are often disputed. We will not pass judgment here except to note that, whether or not the details of the data are sharp enough to support a rigorous proof of noncontext-freeness, it is nonetheless clearly true that, in all these cases, something like reduplication is occurring. A model that could economically capture these constructions as well as any reasonable variant on these examples would go a long way toward the goal of precisely describing the class of string sets that correspond to actual and potential human languages.</Paragraph>
    <Paragraph position="13"> We do not contend that the model to be presented here will weakly describe all natural languages without even the smallest exception. Any such claim for any model, short of ridiculously powerful models, is doomed to failure. Human beings taken in their entirety are general-purpose computers capable of performing any task that a Turing machine or other general-purpose computer model can perform, and so humans can potentially recognize any language describable by any algorithmic process whatsoever (although sometimes too slowly to be of any practical concern). The human language facility appears to be restricted to a much less powerful computational mechanism. However, since the additional power is there for purposes other than language processing, some of this power inevitably will find its way into language processing in some small measure. Indeed, we discuss one Dutch construction that our model cannot handle. We claim that our model captures most of the known constructions that make natural language provably not context-free as string sets, and that it does so with a small addition to the context-free grammar model. No more grandiose claims are made.</Paragraph>
    <Paragraph position="14"> It is easy to add power to a model, and there are numerous models that can weakly generate languages representing all of these noncontext-free constructions.</Paragraph>
    <Paragraph position="15"> Computational Linguistics, Volume 15, Number 4, December 1989 251 Walter J. Savitch A Formal Modeli for Context-Free Languages Augmented with Reduplication However, they all appear to be much too powerful for the simple problems that extend natural language beyond the capacity of context-free grammar. One of the less powerful of the well-known models is indexed grammar, as introduced by Aho (1968) and more recently summarized in the context of natural language by Gazdar (1985). However, even the indexed languages appear to be much more powerful than is needed for natural language syntax. We present a model that is weaker than the indexed grammar model, simpler than the indexed grammar model, and yet capable of handling all context-free constructs plus reduplication.</Paragraph>
    <Paragraph position="16"> A number of other models extend the context-free grammar model in a limited way. Four models that are known to be weakly equivalent and to be strictly weaker than indexed grammars are: the Tree Adjoining Grammars (TAGs) of Joshi (1985, 1987), the Head Grammars of Pollard (1984), the Linear Indexed Grammars of Gazdar (1985), and the Combinatory Categorial Grammars of Steedman (1987, 1988). For a discussion of this equivalence see Joshi et al. (1989). The oldest of these four models is the TAG grammar of Joshi, and we shall refer to the class of languages generated by any of these equivalent grammar formalisms as TAG languages.</Paragraph>
    <Paragraph position="17"> However, the reader should keep in mind that this class of languages could be represented by any of the four equivalent grammar formalisms. As we will see later in this paper, there are TAG languages that cannot be weakly generated by our model. Our model seems to exclude more unnatural strings sets than these models do. Of course, our model may also miss some natural string sets that are TAG languages. Recent work of Joshi (1989) appears to support our conjecture that the class of language described by our model is a subset, and hence a strict subset, of the TAG languages.</Paragraph>
    <Paragraph position="18"> However, all the details of the proof have not yet been worked out, and so any more detailed comparisons to TAG languages will be left for another paper.</Paragraph>
    <Paragraph position="19"> This paper assumes some familarity with the notation and results of formal language theory. Any reader who has worked with context-free grammars, who knows what a pushdown automaton (PDA) is, and who knows what it means to say that PDAs accept exactly the context-free languages should have sufficient background to read this paper. Any needed background can be found in almost any text on formal language theory, such as Harrison (1978) or Hopcroft and Ullman (1979).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML