File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1304_metho.xml
Size: 34,632 bytes
Last Modified: 2025-10-06 14:09:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1304"> <Title>Grammatical Inference and First Language Acquisition</Title> <Section position="3" start_page="26" end_page="26" type="metho"> <SectionTitle> 2 First Language Acquisition </SectionTitle> <Paragraph position="0"> Let us first examine the phenomenon we are concerned with: first language acquisition. In the space of a few years, children almost invariably acquire, in the absence of explicit instruction, one or more of the languages that they are exposed to. A multitude of subsidiary debates have sprung up around this central issue covering questions about critical periods - the ages at which this can take place, the exact nature of the evidence available to the child, and the various phases of linguistic use through which the infant child passes. In the opinion of many researchers, explaining this ability is one of the most important challenges facing linguists and cognitive scientists today.</Paragraph> <Paragraph position="1"> A difficulty for us in this paper is that many of the idealisations made in the study of this field are in fact demonstrably false. Classical assumptions, such as the existence of uniform communities of language users, are well-motivated in the study of the &quot;steady state&quot; of a system, but less so when studying acquisition and change. There is a regrettable tendency to slip from viewing these idealisations correctly - as counter-factual idealizations - to viewing them as empirical facts that need to be explained. Thus, when looking for an appropriate formulation of the problem, we should recall for example the fact that different children do not converge to exactly the same knowledge of language as is sometimes claimed, nor do all of them acquire a language competently at all, since there is a small proportion of children who though apparently neurologically normal fail to acquire language. In the context of our discussion later on, these observations lead us to accept slightly less stringent criteria where we allow a small probability of failure and do not demand perfect equality of hypothesis and target.</Paragraph> </Section> <Section position="4" start_page="26" end_page="27" type="metho"> <SectionTitle> 3 Grammatical Inference </SectionTitle> <Paragraph position="0"> The general field of machine learning has a specialised subfield that deals with the learning of formal languages. This field, Grammatical Inference (GI), is characterised above all by an interest in formal results, both in terms of formal characterisations of the target languages, and in terms of formal proofs either that particular algorithms can learn according to particular definitions, or that sets of language cannot be learnt. In spite of its theoretical bent, GI algorithms have also been applied with some success. Natural language, however is not the only source of real-world applications for GI. Other domains include biological sequence data, artificial languages, such as discovering XML schemas, or sequences of moves of a robot. The field is also driven by technical motives and the intrinsic elegance and interest of the mathematical ideas employed. In summary it is not just about language, and accordingly it has developed a rich vocabulary to deal with the wide range of its subject matter.</Paragraph> <Paragraph position="1"> In particular, researchers are often concerned with formal results - that is we want algorithms where we can prove that they will perform in a certain way. Often, we may be able to empirically establish that a particular algorithm performs well, in the sense of reliably producing an accurate model, while we may be unable to prove formally that the algorithm will always perform in this way. This can be for a number of reasons: the mathematics required in the derivation of the bounds on the errors may be difficult or obscure, or the algorithm may behave strangely when dealing with sets of data which are ill-behaved in some way.</Paragraph> <Paragraph position="2"> The basic framework can be considered as a game played between two players. One player, the teacher, provides information to another, the learner, and from that information the learner must identify the underlying language. We can break down this situation further into a number of elements. We assume that the languages to be learned are drawn in some way from a possibly infinite class of languages, L, which is a set of formal mathematical objects. The teacher selects one of these languages, which we call the target, and then gives the learner a certain amount of information of various types about the target. After a while, the learner then returns its guess, the hypothesis, which in general will be a language drawn from the same class L. Ideally the learner has been able to deduce or induce or abduce something about the target from the information we have given it, and in this case the hypothesis it returns will be identical to, or close in some technical sense, to the target. If the learner can conistently do this, under whatever constraints we choose, then we say it can learn that class of languages. To turn this vague description into something more concrete requires us to specify a number of things.</Paragraph> <Paragraph position="3"> What sort of mathematical object should we use to represent a language? What is the target class of languages? What information is the learner given? What computational constraints does the learner operate under? How close must the target be to the hypothesis, and how do we measure it? This paper addresses the extent to which negative results in GI could be relevant to this real world situation. As always, when negative results from theory are being applied, a certain amount of caution is appropriate in examining the underlying assumptions of the theory and the extent to which these are applicable. As we shall see, in our opinion, none of the current negative results, though powerful, are applicable to the empirical situation. We shall accordingly, at various points, make strong pessimistic assumptions about the learning environment of the child, and show that even under these unrealistically stringent stipulations, the negative results are still inapplicable. This will make the conclusions we come to a little sharper. Conversely, if we wanted to show that the negative results did apply, to be convincing we would have to make rather optimistic assumptions about the learning environment.</Paragraph> </Section> <Section position="5" start_page="27" end_page="30" type="metho"> <SectionTitle> 4 Applying GI to FLA </SectionTitle> <Paragraph position="0"> We now have the delicate task of selecting, or rather constructing, a formal model by identifying the various components we have identified above. We want to choose the model that is the best representation of the learning task or tasks that the infant child must perform. We consider that some of the empirical questions do not yet have clear answers. In those cases, we shall make the choice that makes the learning task more difficult. In other cases, we may not have a clear idea of how to formalise some information source. We shall start by making a significant idealisation: we consider language acquisition as being a single task. Natural languages as traditionally describe have different levels. At the very least we have morphology and syntax; one might also consider inter-sentential or discourse as an additional level. We conflate all of these into a single task: learning a formal language; in the discussion below, for the sake of concreteness and clarity, we shall talk in terms of learning syntax.</Paragraph> <Section position="1" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 4.1 The Language </SectionTitle> <Paragraph position="0"> The first question we must answer concerns the language itself. A formal language is normally defined as follows. Given a finite alphabet , we define the set of all strings (the free monoid) over as .</Paragraph> <Paragraph position="1"> We want to learn a language L . The alphabet could be a set of phonemes, or characters, or a set of words, or a set of lexical categories (part of speech tags). The language could be the set of well-formed sentences, or the set of words that obey the phonotactics of the language, and so on. We reduce all of the different learning tasks in language to a single abstract task - identifying a possibly infinite set of strings. This is overly simplistic since transductions, i.e. mappings from one string to another, are probably also necessary. We are using here a standard definition of a language where every string is unambiguously either in or not in the language.. This may appear unrealistic - if the formal language is meant to represent the set of grammatical sentences, there are well-known methodological problems with deciding where exactly to draw the line between grammatical and ungrammatical sentences. An alternative might be to consider acceptability rather than grammaticality as the defining criterion for inclusion in the set. Moreover, there is a certain amount of noise in the input - There are other possibilities. We could for example use a fuzzy set - i.e. a function from ! [0; 1] where each string has a degree of membership between 0 and 1. This would seem to create more problems than it solves. A more appealing option is to learn distributions, again functions f from ! [0; 1] but where Ps2L f(s) = 1. This is of course the classic problem of language modelling, and is compelling for two reasons. First, it is empirically well grounded - the probability of a string is related to its frequency of occurrence, and secondly, we can de- null duce from the speech recognition capability of humans that they must have some similar capability.</Paragraph> <Paragraph position="2"> Both possibilities - crisp languages, and distributions - are reasonable. The choice depends on what one considers the key phenomena to be explained are - grammaticality judgments by native speakers, or natural use and comprehension of the language. We favour the latter, and accordingly think that learning distributions is a more accurate and more difficult choice.</Paragraph> </Section> <Section position="2" start_page="28" end_page="28" type="sub_section"> <SectionTitle> 4.2 The class of languages </SectionTitle> <Paragraph position="0"> A common confusion in some discussions of this topic is between languages and classes of languages. Learnability is a property of classes of languages. If there is only one language in the class of languages to be learned then the learner can just guess that language and succeed. A class with two languages is again trivially learnable if you have an efficient algorithm for testing membership. It is only when the set of languages is exponentially large or infinite, that the problem becomes non-trivial, from a theoretical point of view. The class of languages we need is a class of languages that includes all attested human languages and additionally all &quot;possible&quot; human languages. Natural languages are thought to fall into the class of mildly context-sensitive languages, (Vijay-Shanker and Weir, 1994), so clearly this class is large enough. It is, however, not necessary that our class be this large. Indeed it is essential for learnability that it is not. As we shall see below, even the class of regular languages contains some subclasses that are computationally hard to learn. Indeed, we claim it is reasonable to define our class so it does not contain languages that are clearly not possible human languages.</Paragraph> </Section> <Section position="3" start_page="28" end_page="28" type="sub_section"> <SectionTitle> 4.3 Information sources </SectionTitle> <Paragraph position="0"> Next we must specify the information that our learning algorithm has access to. Clearly the primary source of data is the primary linguistic data (PLD), namely the utterances that occur in the child's environment. These will consist of both child-directed speech and adult-to-adult speech. These are generally acceptable sentences that is to say sentences that are in the language to be learned. These are called positive samples. One of the most longrunning debates in this field is over whether the child has access to negative data - unacceptable sentences that are marked in some way as such. The consensus (Marcus, 1993) appears to be that they do not. In middle-class Western families, children are provided with some sort of feedback about the well-formedness of their utterances, but this is unreliable and erratic, not a universal of global child-raising.</Paragraph> <Paragraph position="1"> Furthermore this appears to have no effect on the child. Children do also get indirect pragmatic feed-back if their utterances are incomprehensible. In our opinion, both of these would be better modelled by what is called a membership query: the algorithm may generate a string and be informed whether that string is in the language or not. However, we feel that this is too erratic to be considered an essential part of the process. Another question is whether the input data is presented as a flat string or annotated with some sort of structural evidence, which might be derived from prosodic or semantic information.</Paragraph> <Paragraph position="2"> Unfortunately there is little agreement on what the constituent structure should be - indeed many linguistic theories do not have a level of constituent structure at all, but just dependency structure.</Paragraph> <Paragraph position="3"> Semantic information is also claimed as an important source. The hypothesis is that children can use lexical semantics, coupled with rich sources of real-world knowlege to infer the meaning of utterances from the situational context. That would be an extremely powerful piece of information, but it is clearly absurd to claim that the meaning of an utterance is uniquely specified by the situational context. If true, there would be no need for communication or information transfer at all. Of course the context puts some constraints on the sentences that will be uttered, but it is not clear how to incorporate this fact without being far too generous. In summary it appears that only positive evidence can be unequivocally relied upon though this may seem a harsh and unrealistic environment.</Paragraph> </Section> <Section position="4" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 4.4 Presentation </SectionTitle> <Paragraph position="0"> We have now decided that the only evidence available to the learner will be unadorned positive samples drawn from the target language. There are various possibilities for how the samples are selected.</Paragraph> <Paragraph position="1"> The choice that is most favourable for the learner is where they are slected by a helpful teacher to make the learning process as easy as possible (Goldman and Mathias, 1996). While it is certainly true that carers speak to small children in sentences of simple structure (Motherese), this is not true for all of the data that the child has access to, nor is it universally valid. Moreover, there are serious technical problems with formalising this, namely what is called 'collusion' where the teacher provides examples that encode the grammar itself, thus trivialising the learning process. Though attempts have been made to limit this problem, they are not yet completely satisfactory. The next alternative is that the examples are selected randomly from some fixed distribution. This appears to us to be the appropriate choice, subject to some limitations on the distributions that we discuss below. The final option, the most difficult for the learner, is where the sequence of samples can be selected by an intelligent adversary, in an attempt to make the learner fail, subject only to the weak requirement that each string in the language appears at least once. This is the approach taken in the identification in the limit paradigm (Gold, 1967), and is clearly too stringent.</Paragraph> <Paragraph position="2"> The remaining question then regards the distribution from which the samples are drawn: whether the learner has to be able to learn for every possible distribution, or only for distributions from a particular class, or only for one particular distribution.</Paragraph> </Section> <Section position="5" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 4.5 Resources </SectionTitle> <Paragraph position="0"> Beyond the requirement of computability we will wish to place additional limitations on the computational resources that the learner can use. Since children learn the language in a limited period of time, which limits both the amount of data they have access to and the amount of computation they can use, it seems appropriate to disallow algorithms that use unbounded or very large amounts of data or time.</Paragraph> <Paragraph position="1"> As normal, we shall formalise this by putting polynomial bounds on the sample complexity and computational complexity. Since the individual samples are of varying length, we need to allow the computational complexity to depend on the total length of the sample. A key question is what the parameters of the sample complexity polynomial should be. We shall discuss this further below.</Paragraph> </Section> <Section position="6" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 4.6 Convergence Criteria </SectionTitle> <Paragraph position="0"> Next we address the issue of reliability: the extent to which all children acquire language. First, variability in achievement of particular linguistic milestones is high. There are numerous causes including deafness, mental retardation, cerebral palsy, specific language impairment and autism. Generally, autistic children appear neurologically and physically normal, but about half may never speak. Autism, on some accounts, has an incidence of about 0.2%.</Paragraph> <Paragraph position="1"> Therefore we can require learning to happen with arbitrarily high probability, but requiring it to happen with probability one is unreasonable. A related question concerns convergence: the extent to which children exposed to a linguistic environment end up with the same language as others. Clearly they are very close since otherwise communication could not happen, but there is ample evidence from studies of variation (Labov, 1975), that there are non-trivial differences between adults, who have grown up with near-identical linguistic experiences, about the interpretation and syntactic acceptability of simple sentences, quite apart from the wide purely lexical variation that is easily detected. A famous example in English is &quot;Each of the boys didn't come&quot;. Moreover, language change requires some children to end up with slightly different grammars from the older generation. At the very most, we should require that the hypothesis should be close to the target. The function we use to measure the 'distance' between hypothesis and target depends on whether we are learnng crisp languages or distributions. If we are learning distributions then the obvious choice is the Kullback-Leibler divergence - a very strict measure. For crisp languages, the probability of the symmetric difference with respect to some distribution is natural.</Paragraph> </Section> <Section position="7" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 4.7 PAC-learning </SectionTitle> <Paragraph position="0"> These considerations lead us to some variant of the Probably Approximately Correct (PAC) model of learning (Valiant, 1984). We require the algorithm to produce with arbitrarily high probability a good hypothesis. We formalise this by saying that for any > 0 it must produce a good hypothesis with probability more than 1 . Next we require a good hypothesis to be arbitrarily close to the target, so we have a precision and we say that for any > 0, the hypothesis must be less than away from the target.</Paragraph> <Paragraph position="1"> We allow the amount of data it can use to increase as the confidence and precision get smaller. We define PAC-learning in the following way: given a finite alphabet , and a class of languages L over , an algorithm PAC-learns the class L, if there is a polynomial q, such that for every confidence > 0 and precision > 0, for every distribution D over , for every language L in L, whenever the number of samples exceeds q(1= ; 1= ;j j;jLj), the algorithm must produce a hypothesis H such that with probability greater than 1 , PrD(H L > ). Here we use A B to mean the symmetric difference between two sets. The polynomial q is called the sample complexity polynomial. We also limit the amount of computation to some polynomial in the total length of the data it has seen. Note first of all that this is a worst case bound - we are not requiring merely that on average it comes close. Additionally this model is what is called 'distribution-free'. This means that the algorithm must work for every combination of distribution and language. This is a very stringent requirement, only mitigated by the fact that the error is calculated with respect to the same distribution that the samples are drawn from. Thus, if there is a subset of with low aggregate probability under D, the algorithm will not get many sam- null ples from this region but will not be penalised very much for errors in that region. From our point of view, there are two problems with this framework: first, we only want to draw positive samples, but the distributions are over all strings in , and include some that give a zero probability to all strings in the language concerned. Secondly, this is too pessimistic because the distribution has no relation to the language: intuitively it's reasonable to expect the distribution to be derived in some way from the language, or the structure of a grammar generating the language. Indeed there is a causal connection in reality since the sample of the language the child is exposed to is generated by people who do in fact know the language.</Paragraph> <Paragraph position="2"> One alternative that has been suggested is the PAC learning with simple distributions model introduced by (Denis, 2001). This is based on ideas from complexity theory where the samples are drawn according to a universal distribution defined by the conditional Kolmogorov complexity. While mathematically correct this is inappropriate as a model of FLA for a number of reasons. First, learnability is proven only on a single very unusual distribution, and relies on particular properties of this distribution, and secondly there are some very large constants in the sample complexity polynomial.</Paragraph> <Paragraph position="3"> The solution we favour is to define some natural class of distributions based on a grammar or automaton generating the language. Given a class of languages defined by some generative device, there is normally a natural stochastic variant of the device which defines a distribution over that language. Thus regular languages can be defined by a finite-state automaton, and these can be naturally extended to Probabilistic finite state automaton. Similarly context free languages are normally defined by context-free grammmars which can be extended again to to Probabilistic or stochastic CFG. We therefore propose a slight modification of the PACframework. For every class of languages L, defined by some formal device define a class of distributions defined by a stochastic variant of that device. D. Then for each language L, we select the set of distributions whose support is equal to the language and subject to a polynomial bound (q)on the complexity of the distribution in terms of the complexity of the target language: D+L = fD 2 D : L = supp(D)^jDj < q(jLj)g. Samples are drawn from one of these distributions.</Paragraph> <Paragraph position="4"> There are two technical problems here: first, this doesn't penalise over-generalisation. Since the distribution is over positive examples, negative examples have zero weight, so we need some penalty function over negative examples or alternatively require the hypothesis to be a subset of the target. Secondly, this definition is too vague. The exact way in which you extend the &quot;crisp&quot; language to a stochastic one can have serious consequences. When dealing with regular languages, for example, though the class of languages defined by deterministic automata is the same as that defined by non-deterministic languages, the same is not true for their stochastic variants. Additionally, one can have exponential blow-ups in the number of states when determinising automata. Similarly, with CFGs, (Abney et al., 1999) showed that converting between two parametrisations of stochastic Context Free languages are equivalent but that there are blow-ups in both directions. We do not have a completely satisfactory solution to this problem at the moment; an alternative is to consider learning the distributions rather than the languages.</Paragraph> <Paragraph position="5"> In the case of learning distributions, we have the same framework, but the samples are drawn according to the distribution being learned T, and we require that the hypothesis H has small divergence from the target: D(TjjH) < . Since the divergence is infinite if the hypothesis gives probability zero to a string in the target, this will have the consequence that the target must assign a non-zero probability to every string.</Paragraph> </Section> </Section> <Section position="6" start_page="30" end_page="31" type="metho"> <SectionTitle> 5 Negative Results </SectionTitle> <Paragraph position="0"> Now that we have a fairly clear idea of various ways of formalising the situation we can consider the extent to which formal results apply. We start by considering negative results, which in Machine Learning come in two types. First, there are information-theoretic bounds on sample complexity, derived from the Vapnik-Chervonenkis (VC) dimension of the space of languages, a measure of the complexity of the set of hypotheses. If we add a parameter to the sample complexity polynomial that represents the complexity of the concept to be learned then this will remove these problems. This can be the size of a representation of the target which will be a polynomial in the number of states, or simply the number of non-terminals or states. This is very standard in most fields of machine learning.</Paragraph> <Paragraph position="1"> The second problem relates not to the amount of information but to the computation involved.</Paragraph> <Paragraph position="2"> Results derived from cryptographic limitations on computational complexity, can be proved based on widely held and well supported assumptions that certain hard cryptographic problems are insoluble.</Paragraph> <Paragraph position="3"> In what follows we assume that there are no efficient algorithms for common cryptographic prob- null lems such as factoring Blum integers, inverting RSA function, recognizing quadratic residues or learning noisy parity functions.</Paragraph> <Paragraph position="4"> There may be algorithms that will learn with reasonable amounts of data but that require unfeasibly large amounts of computation to find. There are a number of powerful negative results on learning in the purely distribution-free situation we considered and rejected above. (Kearns and Valiant, 1989) showed that acyclic deterministic automata are not learnable even with positive and negative examples. Similarly, (Abe and Warmuth, 1992) showed a slightly weaker representation dependent result on learning with a large alphabet for non-deterministic automata, by showing that there are strings such that maximising the likelihood of the string is NP-hard.</Paragraph> <Paragraph position="5"> Again this does not strictly apply to the partially distribution free situation we have chosen.</Paragraph> <Paragraph position="6"> However there is one very strong result that appears to apply. A straightforward consequence of (Kearns et al., 1994) shows that Acyclic Deterministic Probabilistic FSA over a two letter alphabet cannot be learned under another cryptographic assumption (the noisy parity assumption). Therefore any class of languages that includes this comparatively weak family will not be learnable in out framework.</Paragraph> <Paragraph position="7"> But this rests upon the assumption that the class of possible human languages must include some cryptographically hard functions. It appears that our formal apparatus does not distinguish between these cryptographic functions which hav been consciously designed to be hard to learn, and natural languages which presumably have evolved to be easy to learn since there is no evolutionary pressure to make them hard to decrypt - no intelligent predators eavesdropping for example. Clearly this is a flaw in our analysis: we need to find some more nuanced description for the class of possible human languages that excludes these hard languages or distributions. null</Paragraph> </Section> <Section position="7" start_page="31" end_page="31" type="metho"> <SectionTitle> 6 Positive results </SectionTitle> <Paragraph position="0"> There is a positive result that shows a way forward.</Paragraph> <Paragraph position="1"> A PDFA is -distinguishable the distributions generated from any two states differ by at least in the L1-norm, i.e. there is a string with a difference in probability of at least . (Ron et al., 1995) showed that -distinguishable acyclic PDFAs can be PAC-learned using the KLD as error function in time polynomial in n; 1= ; 1= ; 1= ;j j. They use a variant of a standard state-merging algorithm.</Paragraph> <Paragraph position="2"> Since these are acyclic the languages they define are always finite. This additional criterion of distinguishability suffices to guarantee learnability. This work can be extended to cyclic automata (Clark and Thollard, 2004a; Clark and Thollard, 2004b), and thus the class of all regular languages, with the addition of a further parameter which bounds the expected length of a string generated from any state.</Paragraph> <Paragraph position="3"> The use of distinguishability seems innocuous; in syntactic terms it is a consequence of the plausible condition that for any pair of distinct non-terminals there is some fairly likely string generated by one and not the other. Similarly strings of symbols in natural language tend to have limited length. An alternate way of formalising this is to define a class of distinguishable automata, where the distinguishability of the automata is lower bounded by an inverse polynomial in the number of states. This is formally equivalent, but avoids adding terms to the sample complexity polynomial. In summary this would be a valid solution if all human languages actually lay within the class of regular languages.</Paragraph> <Paragraph position="4"> Note also the general properties of this kind of algorithm: provably learning an infinite class of languages with infinite support using only polynomial amounts of data and computation.</Paragraph> <Paragraph position="5"> It is worth pointing out that the algorithm does not need to &quot;know&quot; the values of the parameters. Define a new parameter t, and set, for example n = t; L = t; = e t; = t 1 and = t 1. This gives a sample complexity polynomial in one parameter q(t). Given a certain amount of data N we can just choose the largest value of t such that q(t) < N, and set the parameters accordingly.</Paragraph> </Section> <Section position="8" start_page="31" end_page="32" type="metho"> <SectionTitle> 7 Parametric models </SectionTitle> <Paragraph position="0"> We can now examine the relevance of these results to the distinction between parametric and non-parametric languages. Parametric models are those where the class of languages is parametrised by a small set of finite-valued (binary) parameters, where the number of paameters is small compared to the log2 of the complexity of the languages. Without this latter constraint the notion is mathematically vacuous, since, for example, any context free grammar in Chomsky normal form can be parametrised with N3 + NM + 1 binary parameters where N is the number of non-terminals and M the number of terminals. This constraint is also necessary for parametric models to make testable empirical predictions both about language universals, developmental evidence and relationships between the two (Hyams, 1986). We neglect here the important issue of lexical learning: we assume, implausibly, that lexical learning can take place completely before syntax learning commences. It has in the past been stated that the finiteness of a language class suffices to guarantee learnability even under a PAC-learning criterion (Bertolo, 2001). This is, in general, false, and arises from neglecting constraints on the sample complexity and the computational complexities both of learning and of parsing. The negative result of (Kearns et al., 1994) discussed above applies also to parametric models. The specific class of noisy parity functions that they prove are unlearnable, are parametrised by a number of binary parameters in a way very reminiscent of a parametric model of language. The mere fact that there are a finite number of parameters does not suffice to guarantee learnability, if the resulting class of languages is exponentially large, or if there is no polynomial algorithm for parsing. This does not imply that all parametrised classes of languages will be unlearnable, only that having a small number of parameters is neither necessary nor sufficient to guarantee efficient learnability. If the parameters are shallow and relate to easily detectable properties of the languages and are independent then learning can occur efficiently (Yang, 2002). If they are &quot;deep&quot; and inter-related, learning may be impossible. Learnability depends more on simple statistical properties of the distributions of the samples than on the structure of the class of languages.</Paragraph> <Paragraph position="1"> Our conclusion then is ultimately that the theory of learnability will not be able to resolve disputes about the nature of first language acquisition: these problems will have to be answered by empirical research, rather than by mathematical analysis.</Paragraph> </Section> class="xml-element"></Paper>