File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1043_intro.xml
Size: 5,307 bytes
Last Modified: 2025-10-06 14:03:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1043"> <Title>Learning Stochastic OT Grammars: A Bayesian approach using Data Augmentation and Gibbs Sampling</Title> <Section position="2" start_page="0" end_page="346" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Optimality Theory (Prince and Smolensky, 1993) is a linguistic theory that dominates the field of phonology, and some areas of morphology and syntax. The standard version of OT contains the following assumptions: * A grammar is a set of ordered constraints ({Ci :</Paragraph> <Paragraph position="2"> {0,1,***}, where S[?] is the set of strings in the language; [?]The author thanks Bruce Hayes, Ed Stabler, Yingnian Wu, Colin Wilson, and anonymous reviewers for their comments. * Each underlying form u corresponds to a set of candidates GEN(u). To obtain the unique surface form, the candidate set is successively filtered according to the order of constraints, so that only the most harmonic candidates remain after each filtering. If only 1 candidate is left in the candidate set, it is chosen as the optimal output.</Paragraph> <Paragraph position="3"> The popularity of OT is partly due to learning algorithms that induce constraint ranking from data. However, most of such algorithms cannot be applied to noisy learning data. Stochastic Optimality Theory (Boersma, 1997) is a variant of Optimality Theory that tries to quantitatively predict linguistic variation. As a popular model among linguists that are more engaged with empirical data than with formalisms, Stochastic OT has been used in a large body of linguistics literature.</Paragraph> <Paragraph position="4"> In Stochastic OT, constraints are regarded as independent normal distributions with unknown means and fixed variance. As a result, the stochastic constraint hierarchy generates systematic linguistic variation. For example, consider a grammar with The probabilities p(.) are obtained by repeatedly sampling the 3 normal distributions, generating the winning candidate according to the ordering of constraints, and counting the relative frequencies in the outcome. As a result, the grammar will assign non-zero probabilities to a given set of outputs, as shown above.</Paragraph> <Paragraph position="5"> The learning problem of Stochastic OT involves fitting a grammar G [?] RN to a set of candidates with frequency counts in a corpus. For example, if the learning data is the above table, we need to find an estimate of G = (u1,u2,u3)1 so that the following ordering relations hold with certain probabilities: null max{C1,C2} > C3; with probability .77 max{C1,C2} < C3; with probability .23 (1) The current method for fitting Stochastic OT models, used by many linguists, is the Gradual Learning Algorithm (GLA) (Boersma and Hayes, 2001). GLA looks for the correct ranking values by using the following heuristic, which resembles gradient descent. First, an input-output pair is sampled from the data; second, an ordering of the constraints is sampled from the grammar and used to generate an output; and finally, the means of the constraints are updated so as to minimize the error. The updating is done by adding or subtracting a &quot;plasticity&quot; value that goes to zero over time. The intuition behind GLA is that it does &quot;frequency matching&quot;, i.e. looking for a better match between the output frequencies of the grammar and those in the data.</Paragraph> <Paragraph position="6"> As it turns out, GLA does not work in all cases2, and its lack of formal foundations has been questioned by a number of researchers (Keller and Asudeh, 2002; Goldwater and Johnson, 2003).</Paragraph> <Paragraph position="7"> However, considering the broad range of linguistic data that has been analyzed with Stochastic OT, it seems unadvisable to reject this model because of the absence of theoretically sound learning methods. Rather, a general solution is needed to eval- null work, the learning problem is formalized as finding the posterior distribution of ranking values (G) given the information on constraint interaction based on input-output pairs (D). The posterior contains all the information needed for linguists' use: for example, if there is a grammar that will generate the exact frequencies as in the data, such a grammar will appear as a mode of the posterior.</Paragraph> <Paragraph position="8"> In computation, the posterior distribution is simulated with MCMC methods because the likelihood function has a complex form, thus making a maximum-likelihood approach hard to perform.</Paragraph> <Paragraph position="9"> Such problems are avoided by using the Data Augmentation algorithm (Tanner and Wong, 1987) to make computation feasible: to simulate the posterior distribution G [?] p(G|D), we augment the parameter space and simulate a joint distribution (G,Y) [?] p(G,Y|D). It turns out that by setting Y as the value of constraints that observe the desired ordering, simulating from p(G,Y|D) can be achieved with a Gibbs sampler, which constructs a Markov chain that converges to the joint posterior distribution (Geman and Geman, 1984; Gelfand and Smith, 1990). I will also discuss some issues related to efficiency in implementation.</Paragraph> </Section> class="xml-element"></Paper>