File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1029_metho.xml

Size: 5,274 bytes

Last Modified: 2025-10-06 14:07:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1029">
  <Title>Inducing Probabilistic Syllable Classes Using Multivariate Clustering</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Multivariate Syllable Clustering
</SectionTitle>
    <Paragraph position="0"> EM-based clustering has been derived and applied to syntax (Rooth et al., 1999). Unfortunately, this approach is not applicable to multivariate data with more than two dimensions. However, we consider syllables to consist of at least three dimensions corresponding to parts of the internal syllable structure: onset, nucleus and coda. Wehave also experimented with 5-dimensional models by adding two more dimensions: position of the syllable in the word and stress status. In our multivariate clustering approach, classes corresponding to syllables are viewed as hidden data in the context of maximum likelihood estimation from incomplete data via the EM algorithm. The two main tasks of EM-based clustering are (i) the induction of a smooth probability model on the data, and (ii) the automatic discovery of class structure in the data. Both aspects are considered in our application. We aim to derive a probability distribution p(y) on syllables y from a large sample. The key idea is to view y as conditioned on an unobserved class c 2 C, where the classes are given no prior interpretation.</Paragraph>
    <Paragraph position="1"> The probability of a syllable y =(y</Paragraph>
    <Paragraph position="3"> jc).</Paragraph>
    <Paragraph position="4"> This assumption makes clustering feasible in the rst place;; later on (in Section 4.1) we will experimentally determine the number jCj of classes such that the assumption is optimally met. The EM algorithm (Dempster et al., 1977) is directed at maximizing the incomplete data log-likelihood L =</Paragraph>
    <Paragraph position="6"> as a function of the probability distribution p for a given empirical probability distribution ~p. Our application is an instance of the EM-algorithm for context-free models (Baum et al., 1970), from which simple re-estimation formulae can be derived. Let f(y) the frequency of syllable y, and jfj =</Paragraph>
    <Paragraph position="8"/>
    <Paragraph position="10"/>
    <Paragraph position="12"> As shown by Baum et al. (1970), every such maximization step increases the log-likelihood function L, and a sequence of re-estimates eventually converges to a (local) maximum.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="55" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> A sample of syllables serves as input to the multivariate clustering algorithm. The German data were extracted from the Stuttgarter Zeitung (STZ), a newspaper corpus of about 31 million words. The English data came from the British National Corpus (BNC), a collection of written and spoken language containing about 100 million words. For both languages, syllables were collected by going through the corpus, looking up the words and their syllabications in a pronunciation dictionary (Baayen et al., 1993)  and counting the occurrence frequencies of the syllable types</Paragraph>
    <Paragraph position="2"> We slightly modied the English pronunciation lexicon to obtain non-empty nuclei, e.g. /idealism/ [aI][dI@][lIzm,] was modied to [aI][dI@][lI][z@m] (SAMPA transcription).</Paragraph>
    <Paragraph position="3">  Subsequent experiments on syllable types (Muller et al., 2000) haveshown that frequency counts representvaluable information for our clustering task. In two experiments, we induced 3-dimensional models based on syllable onset, nucleus, and coda. We collected 9327 distinct German syllables and 13,598 distinct English syllables. The number of syllable classes was systematically varied in iterated training runs and ranged from 1 to 200.</Paragraph>
    <Paragraph position="4"> Figure 1 shows a selected segment of class #0 from a 3-dimensional English model with 12 classes. The rst column displays the class index 0 and the class probability p(0). The most probable onsets and their probabilities are listed in descending order in the second column, as are nucleus and coda in the third and fourth columns, respectively. Empty onsets and codas were labeled NOP[nucleus].</Paragraph>
    <Paragraph position="5"> Class #0 contains the highly frequent function words in, is, it, its as well as the suxes -ing ,-ting, -ling. Notice that these function words and suxes appear to be separated in the 5-dimensional model (classes #1 and #3 in Figure 3).</Paragraph>
    <Paragraph position="6"> In two further experiments, we induced 5-dimensional models, augmented by the additional parameters of position of the syllable in the word and stress status. Syllable position has four values: monosyllabic (ONE), initial (INI), medial (MED), and nal (FIN). Stress  has twovalues: stressed (STR) and unstressed (USTR). We collected 16,595 distinct German syllables and 24,365 distinct English syllables. The number of syllable classes ranged from 1 to 200. Figure 2 illustrates (part of) class #46 from a 5-dimensional German model with 50 classes. Syllable position and stress are displayed in the last two columns.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML