File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2917_metho.xml

Size: 22,572 bytes

Last Modified: 2025-10-06 14:10:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2917">
  <Title>Learning Auxiliary Fronting with Grammatical Inference</Title>
  <Section position="4" start_page="0" end_page="125" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> For some years, a particular set of examples has been used to provide support for nativist theories of first language acquisition (FLA). These examples, which hinge around auxiliary inversion in the formation of questions in English, have been considered to provide a strong argument in favour of the nativist claim: that FLA proceeds primarily through innately specified domain specific mechanisms or knowledge, rather than through the operation of general-purpose cognitive mechanisms. A key point of empirical debate is the frequency of occurrence of the forms in question. If these are vanishingly rare, or non-existent in the primary linguistic data, and yet children acquire the construction in question, then the hypothesis that they have innate knowledge would be supported. But this rests on the assumption that examples of that specific construction are necessary for learning to proceed. In this paper we show that this assumption is false: that this particular construction can be learned without the learner being exposed to any examples of that particular type. Our demonstration is primarily mathematical/computational: we present a simple experiment that demonstrates the applicability of this approach to this particular problem neatly, but the data we use is not intended to be a realistic representation of the primary linguistic data, nor is the particular algorithm we use suitable for large scale grammar induction.</Paragraph>
    <Paragraph position="1"> We present a general purpose context-free grammatical algorithm that is provably correct under a certain learning criterion. This algorithm incorporates no domain specific knowledge: it has no specific information about language; no knowledge of X-bar schemas, no hidden sources of information to reveal the structure. It operates purely on unannotated strings of raw text. Obviously, as all learning algorithms do, it has an implicit learning bias. This very simple algorithm has a particularly clear bias, with a simple mathematical description, that allows a remarkably simple characterisation of the set of languages that it can learn. This algorithm does not use a statistical learning paradigm that has to be tested on large quantities of data. Rather it uses a  symbolic learning paradigm, that works efficiently with very small quantities of data, while being very sensitive to noise. We discuss this choice in some depth below.</Paragraph>
    <Paragraph position="2"> For reasons that were first pointed out by Chomsky (Chomsky, 1975, pages 129-137), algorithms of this type are not capable of learning all of natural language. It turns out, however, that algorithms based on this approach are sufficiently strong to learn some key properties of language, such as the correct rule for forming polar questions.</Paragraph>
    <Paragraph position="3"> In the next section we shall describe the dispute briefly; in the subsequent sections we will describe the algorithm we use, and the experiments we have performed.</Paragraph>
  </Section>
  <Section position="5" start_page="125" end_page="125" type="metho">
    <SectionTitle>
2 The Dispute
</SectionTitle>
    <Paragraph position="0"> We will present the dispute in traditional terms, though later we shall analyse some of the assumptions implicit in this description. In English, polar interrogatives (yes/no questions) are formed by fronting an auxiliary, and adding a dummy auxiliary &amp;quot;do&amp;quot; if the main verb is not an auxiliary. For example, null Example 1a The man is hungry.</Paragraph>
    <Paragraph position="1"> Example 1b Is the man hungry? When the subject NP has a relative clause that also contains an auxiliary, the auxiliary that is moved is not the auxiliary in the relative clause, but the one in the main (matrix) clause.</Paragraph>
    <Paragraph position="2"> Example 2a The man who is eating is hungry.</Paragraph>
    <Paragraph position="3"> Example 2b Is the man who is eating hungry? An alternative rule would be to move the first occurring auxiliary, i.e. the one in the relative clause, which would produce the form Example 2c Is the man who eating is hungry? In some sense, there is no reason that children should favour the correct rule, rather than the incorrect one, since they are both of similar complexity and so on. Yet children do in fact, when provided with the appropriate context, produce sentences of the form of Example 2b, and rarely if ever produce errors of the form Example 2c (Crain and Nakayama, 1987). The problem is how to account for this phenomenon.</Paragraph>
    <Paragraph position="4"> Chomsky claimed first, that sentences of the type in Example 2b are vanishingly rare in the linguistic environment that children are exposed to, yet when tested they unfailingly produce the correct form rather than the incorrect Example 2c. This is put forward as strong evidence in favour of innately specified language specific knowledge: we shall refer to this view as linguistic nativism.</Paragraph>
    <Paragraph position="5"> In a special volume of the Linguistic Review, Pullum and Scholz (Pullum and Scholz, 2002), showed that in fact sentences of this type are not rare at all. Much discussion ensued on this empirical question and the consequences of this in the context of arguments for linguistic nativism. These debates revolved around both the methodology employed in the study, and also the consequences of such claims for nativist theories. It is fair to say that in spite of the strength of Pullum and Scholz's arguments, nativists remained completely unconvinced by the overall argument.</Paragraph>
    <Paragraph position="6"> (Reali and Christiansen, 2004) present a possible solution to this problem. They claim that local statistics, effectively n-grams, can be sufficient to indicate to the learner which alternative should be preferred. However this argument has been carefully rebutted by (Kam et al., 2005), who show that this argument relies purely on a phonological coincidence in English. This is unsurprising since it is implausible that a flat, finite-state model should be powerful enough to model a phenomenon that is clearly structure dependent in this way.</Paragraph>
    <Paragraph position="7"> In this paper we argue that the discussion about the rarity of sentences that exhibit this particular structure is irrelevant: we show that simple grammatical inference algorithms can learn this property even in the complete absence of sentences of this particular type. Thus the issue as to how frequently an infant child will see them is a moot point.</Paragraph>
  </Section>
  <Section position="6" start_page="125" end_page="129" type="metho">
    <SectionTitle>
3 Algorithm
</SectionTitle>
    <Paragraph position="0"> Context-free grammatical inference algorithms are explored in two different communities: in grammatical inference and in NLP. The task in NLP is normally taken to be one of recovering appropriate annotations (Smith and Eisner, 2005) that normally represent constituent structure (strong learning), while in grammatical inference, researchers  are more interested in merely identifying the language (weak learning). In both communities, the best performing algorithms that learn from raw positive data only 1, generally rely on some combination of three heuristics: frequency, information theoretic measures of constituency, and finally substitutability. 2 The first rests on the observation that strings of words generated by constituents are likely to occur more frequently than by chance. The second heuristic looks for information theoretic measures that may predict boundaries, such as drops in conditional entropy. The third method which is the foundation of the algorithm we use, is based on the distributional analysis of Harris (Harris, 1954). This principle has been appealed to by many researchers in the field of grammatical inference, but these appeals have normally been informal and heuristic (van Zaanen, 2000).</Paragraph>
    <Paragraph position="1"> In its crudest form we can define it as follows: given two sentences &amp;quot;I saw a cat over there&amp;quot;, and &amp;quot;I saw a dog over there&amp;quot; the learner will hypothesize that &amp;quot;cat&amp;quot; and &amp;quot;dog&amp;quot; are similar, since they appear in the same context &amp;quot;I saw a __ there&amp;quot;. Pairs of sentences of this form can be taken as evidence that two words, or strings of words are substitutable.</Paragraph>
    <Section position="1" start_page="126" end_page="126" type="sub_section">
      <SectionTitle>
3.1 Preliminaries
</SectionTitle>
      <Paragraph position="0"> We briefly define some notation.</Paragraph>
      <Paragraph position="1"> An alphabet S is a finite nonempty set of symbols called letters. A string w over S is a finite sequence w = a1a2 . . .an of letters. Let |w |denote the length of w. In the following, letters will be indicated by a, b, c, . . ., strings by u, v, . . ., z, and the empty string by l. Let S[?] be the set of all strings, the free monoid generated by S. By a language we mean any subset L [?] S[?]. The set of all substrings of a language L is denoted Sub(L) = {u [?] S+ : [?]l, r, lur [?] L} (notice that the empty word does not belong to Sub(L)). We shall assume an order [?] or precedesequal on S which we shall extend to S[?] in the normal way by saying that u [?] v if |u |&lt; |v |or |u |= |v| and u is lexicographically before v.</Paragraph>
      <Paragraph position="2"> A grammar is a quadruple G = &lt;V, S, P, S&gt; where S is a finite alphabet of terminal symbols, V  or attraction.</Paragraph>
      <Paragraph position="3"> is a finite alphabet of variables or non-terminals, P is a finite set of production rules, and S [?] V is a start symbol.</Paragraph>
      <Paragraph position="4"> If P [?] V x(S[?]V )+ then the grammar is said to be context-free (CF), and we will write the productions as T - w.</Paragraph>
      <Paragraph position="5"> We will write uTv = uwv when T - w [?] P.</Paragraph>
      <Paragraph position="6"> [?]= is the reflexive and transitive closure of =. In general, the definition of a class L relies on a class R of abstract machines, here called representations, together with a function L from representations to languages, that characterize all and only the languages of L: (1) [?]R [?] R,L(R) [?] L and (2) [?]L [?] L,[?]R [?] R such that L(R) = L.</Paragraph>
      <Paragraph position="7"> Two representations R1 and R2 are equivalent iff</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="2" start_page="126" end_page="127" type="sub_section">
      <SectionTitle>
3.2 Learning
</SectionTitle>
      <Paragraph position="0"> We now define our learning criterion. This is identification in the limit from positive text (Gold, 1967), with polynomial bounds on data and computation, but not on errors of prediction (de la Higuera, 1997).</Paragraph>
      <Paragraph position="1"> A learning algorithm A for a class of representations R, is an algorithm that computes a function from a finite sequence of strings s1, . . ., sn to R. We define a presentation of a language L to be an infinite sequence of elements of L such that every element of L occurs at least once. Given a presentation, we can consider the sequence of hypotheses that the algorithm produces, writing Rn = A(s1, . . .sn) for the nth such hypothesis.</Paragraph>
      <Paragraph position="2"> The algorithm A is said to identify the class R in the limit if for every R [?] R, for every presentation of L(R), there is an N such that for all n &gt; N, Rn = RN and L(R) = L(RN).</Paragraph>
      <Paragraph position="3"> We further require that the algorithm needs only polynomially bounded amounts of data and computation. We use the slightly weaker notion defined by de la Higuera (de la Higuera, 1997).</Paragraph>
      <Paragraph position="4"> Definition A representation class R is identifiable in the limit from positive data with polynomial time and data iff there exist two polynomials p(), q() and  an algorithm A such that S [?] L(R) 1. Given a positive sample S of size m A returns a representation R [?] R in time p(m), such that 2. For each representation R of size n there exists  a characteristic set CS of size less than q(n) such that if CS [?] S, A returns a representation Rprime such that L(R) = L(Rprime).</Paragraph>
    </Section>
    <Section position="3" start_page="127" end_page="128" type="sub_section">
      <SectionTitle>
3.3 Distributional learning
</SectionTitle>
      <Paragraph position="0"> The key to the Harris approach for learning a language L, is to look at pairs of strings u and v and to see whether they occur in the same contexts; that is to say, to look for pairs of strings of the form lur and lvr that are both in L. This can be taken as evidence that there is a non-terminal symbol that generates both strings. In the informal descriptions of this that appear in Harris's work, there is an ambiguity between two ideas. The first is that they should appear in all the same contexts; and the second is that they should appear in some of the same contexts. We can write the first criterion as follows: [?]l, r lur [?] L if and only if lvr [?] L (1) This has also been known in language theory by the name syntactic congruence, and can be written u [?]L v.</Paragraph>
      <Paragraph position="1"> The second, weaker, criterion is [?]l, r lur [?] L and lvr [?] L (2) We call this weak substitutability and write it as u .=L v. Clearly u [?]L v implies u .=L v when u is a substring of the language. Any two strings that do not occur as substrings of the language are obviously syntactically congruent but not weakly substitutable. First of all, observe that syntactic congruence is a purely language theoretic notion that makes no reference to the grammatical representation of the language, but only to the set of strings that occur in it. However there is an obvious problem: syntactic congruence tells us something very useful about the language, but all we can observe is weak substitutability. null When working within a Gold-style identification in the limit (IIL) paradigm, we cannot rely on statistical properties of the input sample, since they will in general not be generated by random draws from a fixed distribution. This, as is well known, severely limits the class of languages that can be learned under this paradigm. However, the comparative simplicity of the IIL paradigm in the form when there are polynomial constraints on size of characteristic sets and computation(de la Higuera, 1997) makes it a suitable starting point for analysis.</Paragraph>
      <Paragraph position="2"> Given these restrictions, one solution to this problem is simply to define a class of languages where substitutability implies congruence. We call these the substitutable languages: A language L is substitutable if and only if for every pair of strings u, v, u .=L v implies u [?]L v. This rather radical solution clearly rules out the syntax of natural languages, at least if we consider them as strings of raw words, rather than as strings of lexical or syntactic categories. Lexical ambiguity alone violates this requirement: consider the sentences &amp;quot;The rose died&amp;quot;, &amp;quot;The cat died&amp;quot; and &amp;quot;The cat rose from its basket&amp;quot;. A more serious problem is pairs of sentences like &amp;quot;John is hungry&amp;quot; and &amp;quot;John is running&amp;quot;, where it is not ambiguity in the syntactic category of the word that causes the problem, but rather ambiguity in the context. Using this assumption, whether it is true or false, we can then construct a simple algorithm for grammatical inference, based purely on the idea that whenever we find a pair of strings that are weakly substitutable, we can generalise the hypothesized language so that they are syntactically congruent.</Paragraph>
      <Paragraph position="3"> The algorithm proceeds by constructing a graph where every substring in the sample defines a node.</Paragraph>
      <Paragraph position="4"> An arc is drawn between two nodes if and only if the two nodes are weakly substitutable with respect to the sample, i.e. there is an arc between u and v if and only if we have observed in the sample strings of the form lur and lvr. Clearly all of the strings in the sample will form a clique in this graph (consider when l and r are both empty strings). The connected components of this graph can be computed in time polynomial in the total size of the sample. If the language is substitutable then each of these components will correspond to a congruence class of the language.</Paragraph>
      <Paragraph position="5"> There are two ways of doing this: one way, which is perhaps the purest involves defining a reduction system or semi-Thue system which directly captures this generalisation process. The second way, which we present here, will be more familiar to computational linguists, and involves constructing a grammar. null</Paragraph>
    </Section>
    <Section position="4" start_page="128" end_page="128" type="sub_section">
      <SectionTitle>
3.4 Grammar construction
</SectionTitle>
      <Paragraph position="0"> Simply knowing the syntactic congruence might not appear to be enough to learn a context-free grammar, but in fact it is. In fact given the syntactic congruence, and a sample of the language, we can simply write down a grammar in Chomsky normal form, and under quite weak assumptions this grammar will converge to a correct grammar for the language.</Paragraph>
      <Paragraph position="1"> This construction relies on a simple property of the syntactic congruence, namely that is in fact a congruence: i.e., u [?]L v implies [?]l, r lur [?]L lvr We define the syntactic monoid to be the quotient of the monoid S[?]/ [?]L. The monoid operation [u][v] = [uv] is well defined since if u [?]L uprime and v [?]L vprime then uv [?]L uprimevprime.</Paragraph>
      <Paragraph position="2"> We can construct a grammar in the following trivial way, from a sample of strings where we are given the syntactic congruence.</Paragraph>
      <Paragraph position="3">  contains all the strings of the language.</Paragraph>
      <Paragraph position="4"> This defines a grammar in CNF. At first sight, this construction might appear to be completely vacuous, and not to define any strings beyond those in the sample. The situation where it generalises is when two different strings are congruent: if uv = w [?] wprime = uprimevprime then we will have two different rules [w] - [u][v] and [w] - [uprime][vprime], since [w] is the same non-terminal as [wprime].</Paragraph>
      <Paragraph position="5"> A striking feature of this algorithm is that it makes no attempt to identify which of these congruence classes correspond to non-terminals in the target grammar. Indeed that is to some extent an ill-posed question. There are many different ways of assigning constituent structure to sentences, and indeed some reputable theories of syntax, such as dependency grammars, dispense with the notion of constituent structure all together. De facto standards, such as the Penn treebank annotations are a somewhat arbitrary compromise among many different possible analyses. This algorithm instead relies on the syntactic monoid, which expresses the combinatorial structure of the language in its purest form.</Paragraph>
    </Section>
    <Section position="5" start_page="128" end_page="129" type="sub_section">
      <SectionTitle>
3.5 Proof
</SectionTitle>
      <Paragraph position="0"> We will now present our main result, with an outline proof. For a full proof the reader is referred to (Clark and Eyraud, 2005).</Paragraph>
      <Paragraph position="1"> Theorem 1 This algorithm polynomially identifies in the limit the class of substitutable context-free languages.</Paragraph>
      <Paragraph position="2"> Proof (Sketch) We can assume without loss of generality that the target grammar is in Chomsky normal form. We first define a characteristic set, that is to say a set of strings such that whenever the sample includes the characteristic set, the algorithm will output a correct grammar.</Paragraph>
      <Paragraph position="3"> We define w(a) [?] S[?] to be the smallest word, according to [?], generated by a [?] (S [?] V )+. For each non-terminal N [?] V define c(N) to be the smallest pair of terminal strings (l, r) (extending [?] from S[?] to S[?] x S[?], in some way), such that S [?]= lNr.</Paragraph>
      <Paragraph position="4"> We can now define the characteristic set CS =</Paragraph>
      <Paragraph position="6"> The cardinality of this set is at most |P |which is clearly polynomially bounded. We observe that the computations involved can all be polynomially bounded in the total size of the sample.</Paragraph>
      <Paragraph position="7"> We next show that whenever the algorithm encounters a sample that includes this characteristic set, it outputs the right grammar. We write ^G for the learned grammar. Suppose [u] [?]=^G v. Then we can see that u [?]L v by induction on the maximum length of the derivation of v. At each step we must use some rule [uprime] = [vprime][wprime]. It is easy to see that every rule of this type preserves the syntactic congruence of the left and right sides of the rules. Intuitively, the algorithm will never generate too large a language, since the languages are substitutable. Conversely, if we have a derivation of a string u with respect to the target grammar G, by  construction of the characteristic set, we will have, for every production L - MN in the target grammar, a production in the hypothesized grammar of the form [w(L)] - [w(M)][w(N)], and for every production of the form L - a we have a production [w(L)] - a. A simple recursive argument shows that the hypothesized grammar will generate all the strings in the target language. Thus the grammar will generate all and only the strings required (QED).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="129" end_page="129" type="metho">
    <SectionTitle>
3.6 Related work
</SectionTitle>
    <Paragraph position="0"> This is the first provably correct and efficient grammatical inference algorithm for a linguistically interesting class of context-free grammars (but see for example (Yokomori, 2003) on the class of very simple grammars). It can also be compared to Angluin's famous work on reversible grammars (Angluin, 1982) which inspired a similar paper(Pilato and Berwick, 1985).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML