File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1013_intro.xml
Size: 7,643 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1013"> <Title>Partially Distribution-Free Learning of Regular Languages from Positive Samples</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Appropriateness </SectionTitle> <Paragraph position="0"> Regular languages are widely used in a number of di erent applications drawn from numerous domains such as computational biology, robotics etc. In many of these areas, e cient learning algorithms are desirable but in each the exact requirements will be di erent since the sources of information, and the desired properties of the algorithms vary widely. We argue here that learning algorithms in NLP have certain special properties that make the particular learnability result we study here useful. The most important feature in our opinion is the necessity for learning from positive examples only.</Paragraph> <Paragraph position="1"> Negative examples in NLP are rarely available.</Paragraph> <Paragraph position="2"> Even in a binary classi cation problem, there will often be some overlap between the classes so that examples labelled with are not necessarily negative examples of the class labelled with +. For this reason alone we consider a traditional distribution-free PAC-learning framework to be wholly inappropriate. An essential part of the PAC-learning framework is a sort of symmetry between the positive and negative examples. Furthermore, there are a number of negative results which rule out distribution free learning of regular languages (Kearns et al., 1994).</Paragraph> <Paragraph position="3"> A related problem is that in the sorts of learning situations that occur in practice in NLP problems, and also those such as rst language acquisition that one wishes to model formally, the distribution of examples is dependent on the concept being learned. Thus if we are modelling the acquisition of the grammar of a language, the positive examples are the grammatical, or perhaps acceptable, sentences of the target language. The distribution of examples is clearly highly dependent on the particular language, simply as a matter of fact, in that the sentences in the sample are generated by people who have acquired the language.</Paragraph> <Paragraph position="4"> It thus seems reasonable to require the distribution to be drawn from some limited class that depends on the target concept and generates only positive examples { i.e. where the support of the distribution is identical to the positive part of the target concept.</Paragraph> <Paragraph position="5"> Our proposal is that when the class of languages is de ned by some simple class of automata, we can consider only those distributions generated by the corresponding stochastic automata. The set of distributions is restricted and thus we call this partially distribution free. Thus when learning the class of regular languages, which are generated by deterministic nite-state automata, we select the class of distributions which are generated by PDFAs.</Paragraph> <Paragraph position="6"> Similarly, context free languages are normally de ned by context-free grammmars which can be extended again to probabilistic or stochastic context free grammars.</Paragraph> <Paragraph position="7"> Formally, for every class of languages, L, dened by some formal device de ne a class of distributions, D, de ned by a stochastic variant of that device. Then for each language L, we select the set of distributions whose support is equal to the language:</Paragraph> <Paragraph position="9"> Samples are drawn from one of these distributions. There are two technical problems here: rst, this doesn't penalise over-generalisation.</Paragraph> <Paragraph position="10"> Since the distribution is over positive examples, negative examples have zero weight { which would give a hypothesis of all strings zero error. We therefore need some penalty function over negative examples or alternatively require the hypothesis to be a subset of the target, and use a one-sided loss function as in Valiant's original paper (Valiant, 1984), which is what we do here. Secondly, this de nition is too vague.</Paragraph> <Paragraph position="11"> The exact way in which you extend the \crisp&quot; language to a stochastic one can have serious consequences. When dealing with regular languages, for example, though the class of languages de ned by deterministic automata is the same as that de ned by non-deterministic languages, the same is not true for their stochastic variants. Additionally, one can have exponential blow-ups in the number of states when determinizing automata. Similarly, with context free languages, (Abney et al., 1999) showed that converting between two parametrisations of models for stochastic context free languages are equivalent but that there are blow-ups in both directions.</Paragraph> <Paragraph position="12"> It is interesting to compare this to the PAC-learning with simple distributions model (Denis, 2001). There, the class of distributions is limited to a single distribution derived from algorithmic complexity theory. There are a number of reasons why this is not appropriate.</Paragraph> <Paragraph position="13"> First there is a computational issue: since Kolmogorov complexity is not computable, sampling from the distribution is not possible, though a lower bound on the probabilities can be de ned. Secondly, there are very large constants in the sample complexity polynomial. Finally and most importantly, there is no reason to think that in the real world, samples will be drawn from this distribution; in some sense it is the easiest distribution to learn from since it dominates every other distribution up to a multiplicative factor.</Paragraph> <Paragraph position="14"> We reject the identi cation in the limit paradigm introduced by (Gold, 1967) as unsuitable for three reasons. First it is only an asymptotic bound that says nothing about the performance of the algorithms on nite amounts of data; secondly because it must learn under all presentations of the data even when these are chosen by an adversary to make it hard to learn, and thirdly because it has no bounds on the amount of computation allowed.</Paragraph> <Paragraph position="15"> An alternative way to conceive of this problem is to consider the task of learning distributions directly (Kearns et al., 1994), a task related to probability density estimation and language modelling, where the algorithm is given examples drawn from a distribution and must approximate the distribution closely according to some distance metric: usually the Kullback-Leibler divergence or the variational distance. We consider the choice between the distribution-learning analysis, and the analysis we present here to depend on what the underlying task or phenomena to be modelled is. If it is the probability of the event occurring, then the distribution modelling analysis is better. If on the other hand it concerns binary judgments about the membership of strings in some set then the analysis we present here is preferable.</Paragraph> <Paragraph position="16"> The result of (Kearns et al., 1994) shows up a further problem. Under a standard cryptographic assumption the class of acyclic PDFAs over a two-letter alphabet are not learnable since the class of noisy parity functions can be embedded in this simple subclass of PDFAs.</Paragraph> <Paragraph position="17"> (Ron et al., 1995) show that this can be circumvented by adding an additional parameter to the sample complexity polynomial, the distinguishability, which we de ne below.</Paragraph> </Section> class="xml-element"></Paper>