XML Viewer - p01-1053

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1053_metho.xml
Size: 16,209 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1053">
  <Title>Automatic Detection of Syllable Boundaries Combining the Advantages of Treebank and Bracketed Corpora Training</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Treebank Training (TT) and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Bracketed Corpora Training (BCT)
</SectionTitle>
      <Paragraph position="0"> Treebank grammars are context-free grammars (CFG) that are directly read from production rules of a hand-parsed treebank. The probability of each rule is assigned by observing how often each rule was used in the training corpus, yielding a probabilistic context-free grammar. In syntax it is a commonly used method, e.g. Charniak (1996) extracted a treebank grammar from the Penn Wall Street Journal. The advantages of treebank training are the simple procedure, and the good results which are due to the fact that for each word that appears in the training corpus there is only one possible analysis. The disadvantage is that grammars which are read off a treebank are dependent on the quality of the treebank. There is no freedom of putting more information into the grammar. null Bracketed Corpora Training introduced by Pereira and Schabes (1992) employs a context-free grammar and a training corpus, which is partially tagged with brackets. The probability of a rule is inferred by an iterative training procedure with an extended version of the inside-outside algorithm. However, only those analyses are considered that meet the tagged brackets (here syllable brackets). Usually the context-free grammars generate more than one analysis. BCT reduces the large number of analyses. We utilize a special case of BCT where the number of analyses is</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Combining the Advantages of TT and
BCT
</SectionTitle>
    <Paragraph position="0"> Our method used for the experiments is based on treebank training as well as bracketed corpora training. The main idea is that there are large pronunciation dictionaries that provide information about how words are transcribed and how they are syllabified. We want to exploit this linguistic knowledge that was put into these dictionaries. For our experiments we employ a pronunciation dictionary, CELEX (Baayen et al. (1993)) that provides syllable boundaries, our so-called treebank. We use the syllable boundaries as brackets. The advantage of BCT can be utilized: writing grammars using linguistic knowledge. With our method a special case of BCT is applied where the brackets in combination with a manually constructed grammar guarantee a single analysis in the training step with maximal linguistic information.</Paragraph>
    <Paragraph position="1"> Figure 2 depicts our new algorithm. We manually construct different linguistically motivated context-free grammars with brackets marking the syllable boundaries. We start with a simple grammar and continue to add more linguistic information to the advanced grammars. The input of the grammars is a bracketed corpus that was extracted from the pronunciation dictionary CELEX. In a treebank training step we obtain a probabilistic context-free grammar (PCFG) by observing how often each rule was used in the training corpus.</Paragraph>
    <Paragraph position="2"> The brackets of the input guarantee an unambigous analysis of each word. Thus, we can apply  the formula of treebank training given by (Char(1.1) 0.1774 Word![ Syl ] (1.2) 0.5107 Word![ Syl ][Syl ] (1.3) 0.1997 Word![ Syl ][Syl ][Syl ] (1.4) 0.4915 Syl!Onset Nucleus Coda (1.5) 0.3591 Syl!Onset Nucleus (1.6) 0.0716 Syl!Nucleus Coda (1.7) 0.0776 Syl!Nucleus (1.8) 0.9045 Onset!On (1.9) 0.0918 Onset!On On (1.10) 0.0036 Onset!On On On</Paragraph>
    <Paragraph position="4"> niak, 1996): if r is a rule, let jrj be the number of times r occurred in the parsed corpus and (r) be the non-terminal that r expands, then the probability assigned to r is given by</Paragraph>
    <Paragraph position="6"> We then transform the PCFG by dropping the brackets in the rules resulting in an analysis grammar. The bracketless analysis grammar is used for parsing the input without brackets; i.e., the phoneme strings are parsed and the syllable boundaries are extracted from the most probable parse. We want to exemplify our method by means of a syllable structure grammar and an exemplary phoneme string.</Paragraph>
    <Paragraph position="7"> Grammar. We experimented with a series of grammars, which are described in details in section 4.2. In the following we will exemplify how the algorithm works. We chose the syllable structure grammar, which divides a syllable into onset, nucleus and coda. The nucleus is obligatory which can be either a vowel or a diphtong. All phonemes of a syllable that are on the left-hand side of the nucleus belong to the onset and the phonemes on the right-hand side pertain to the coda. The onset or the coda may be empty. The context-free grammar fragment in Figure 3 describes a so called training grammar with brackets. null We use the input word &amp;quot;Forderung&amp;quot; (claim) [fOR][d@][RUN] in the training step. The unambiguous analysis of the input word with the syllable structure grammar is shown in Figure 1.</Paragraph>
    <Paragraph position="8"> Training. In the next step we train the context-free training grammar. Every grammar rule appearing in the grammar obtains a probability depending on the frequency of appearance in the training corpus, yielding a PCFG. A fragment  of the syllable structure grammar is shown in Figure 3 (with the recieved probabilities).</Paragraph>
    <Paragraph position="9"> Rules (1.1)-(1.3) show that German disyllabic words are more probable than monosyllabic and trisyllabic words in the training corpus of 389000 words. If we look at the syllable structure, then it is more common that a syllable consists of an onset, nucleus, and coda than a syllable comprising the onset and nucleus; the least probable structure are syllables with an empty onset, and syllables with empty onset and empty coda. Rules (1.8)-(1.10) show that simple onsets are preferred over complex ones, which is also true for codas.</Paragraph>
    <Paragraph position="10"> Furthermore, the voiced stop [d] is more likely to appear in the onset than the voiceless fricative [f]. Rules (1.19)-(1.20) show the Coda consonants with descending probability: [R], [N].</Paragraph>
    <Paragraph position="11"> Grammar transformation. In a further step we transform the obtained PCFG by dropping all syllable boundaries (brackets). Rules (1.4)-(1.20) do not change in the fragment of the syllable structure grammar. However, the rules (1.1)-(1.3) of the analysis grammar are  affected by the transformation, e.g. the rule (1.2.) Word ![ Syl ][Syl ] would be transformed to (1.2.') Word ! Syl Syl, dropping the brackets Predicting syllable boundaries. Our system is now able to predict syllable boundaries with the transformed PCFG and a parser. The input of the system is a phoneme string without brackets. The  phoneme string [fORd@RUN] (claim) gets the following possible syllabifications according to the syllable structure grammar: [fO][Rd@R][UN], [fO][Rd@][RUN], [fOR][d@R][UN], [fOR][d@][RUN], [fORd][@R][UN] and [fORd][@][RUN] .</Paragraph>
    <Paragraph position="12"> The final step is to choose the most probable analysis. The subsequent tree depicts the most probable analysis: [fOR][d@][RUN], which is also the correct analysis with the overall word probability of 0.5114. The probability of one  The grammar was trained on 389000 words analysis is defined as the product of the probabilities of the grammar rules appearing in the analysis normalized by the sum of all analysis probabilities of the given word. The category &amp;quot;Syl&amp;quot; shows which phonemes belong to the syllable, it indicates the beginning and the end of a syllable. The syllable boundaries can be read off the tree: [fOR][d@][RUN].</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We experimented with a series of grammars: the first grammar, a treebank grammar, was automatically read from the corpus, which describes a syllable consisting of a phoneme sequence. There are no intermediate levels between the syllable and the phonemes. The second grammar is a phoneme grammar where only the number of phonemes is important. The third grammar is a consonant-vowel grammar with the linguistic information that there are consonants and vowels.</Paragraph>
    <Paragraph position="1"> The fourth grammar, a syllable structure grammar is enriched with the information that the consonant in the onset and coda are subject to certain restrictions. The last grammar is a positional syllable structure grammar which expresses that the consonants of the onset and coda are restricted according to the position inside of a word (e.g, initial, medial, final or monosyllabic). These grammars were trained on different sizes of corpora and then evaluated. In the following we first introduce the training procedure and then describe the grammars in details. In section 5 the evaluation of the system is described.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Training procedure
</SectionTitle>
      <Paragraph position="0"> We use a part of a German newspaper corpus, the Stuttgarter Zeitung, consisting of 3 million words which are divided into 9/10 training and 1/10 test corpus. In a first step, we look up the words and their syllabification in a pronunciation dictionary.</Paragraph>
      <Paragraph position="1"> The words not appearing in the dictionary are discarded. Furthermore we want to examine the influence of the size of the training corpus on the results of the evaluation. Therefore, we split the training corpus into 9 corpora, where the size of the corpora increases logarithmically from 4500 to 2.1 million words. These samples of words serve as input to the training procedure.</Paragraph>
      <Paragraph position="2"> In a treebank training step we observe for each rule in the training grammar how often it is used for the training corpus. The grammar rules with their probabilities are transformed into the analysis grammar by discarding the syllable boundaries. The grammar is then used for predicting syllable boundaries in the test corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 Description of grammars
</SectionTitle>
      <Paragraph position="0"> Treebank grammar. We started with an automatically generated treebank grammar. The grammar rules were read from a lexicon. The number of lexical entries ranged from 250 items to 64000 items. The grammars obtained start with 460 rules for the smallest training corpus, increasing to 6300 rules for the largest training corpus. The grammar describes that words are composed of syllables which consist of a string of phonemes or a single phoneme. The following table shows the frequencies of some of the rules of the analysis grammar that are required to analyze the word [fORd@RUN] (claim):  (3.1) 0.1329 Word ! Syl Syl Syl (3.2) 0.0012 Syl ! fOR (3.3) 0.0075 Syl ! d@ (3.4) 0.0050 Syl ! d@R (3.5) 0.0020 Syl ! RUN (3.6) 0.0002 Syl ! UN  Rule (3.1) describes a word that branches to three syllables. The rules (3.2)-(3.6) depict that the syllables comprise different phoneme strings. For example, the word &amp;quot;Forderung&amp;quot; (claim) can result in the following two analyses:  The right tree receives the overall probability of (0.0846) and the left tree (0.9153), which means that the word [fORd@RUN] would be syllabified: [fOR][d@][RUN] (which is the correct analysis). Phoneme grammar. A second grammar is automatically generated where an abstract level is introduced. Every input phoneme is tagged with the phoneme label: P. A syllable consists of a phoneme sequence, which means that the number of phonemes and syllables is the decisive factor for calculating the probability of a word segmentation (into syllables). The following table shows a fragment of the analysis grammar with the rule frequencies. The grammar consists of 33 rules.</Paragraph>
      <Paragraph position="2"> The second and third rule describe that a three-phonemic syllable is preferred over twophonemic syllables. Rules (4.4)-(4.6) show that P is re-written by the phonemes: [f], [O], and [R].</Paragraph>
      <Paragraph position="3"> The word &amp;quot;Forderung&amp;quot; can be analyzed with the training grammar as follows (two examples out of 4375 possible analyses):</Paragraph>
      <Paragraph position="5"> Consonant-vowel grammar. In comparison with the phoneme grammar, the consonant-vowel (CV) grammar describes a syllable as a consonant-vowel-consonant (CVC) sequence (Clements and Keyser, 1983). The linguistic knowledge that a syllable must contain a vowel is added to the CV grammar, which consists of 31 rules.</Paragraph>
      <Paragraph position="7"> Rule (5.1) shows that a three-syllabic word is more likely to appear than a mono-syllabic word (rule (5.2)). A CVC sequence is more probable than an open CV syllable. The rules (5.5)-(5.8) depict some consonants and vowels and their probability. The word &amp;quot;Forderung&amp;quot; can be analyzed as follows (two examples out of seven possible analyses):  The correct analysis (left tree) is more probable than the wrong one (right tree).</Paragraph>
      <Paragraph position="8"> Syllable structure grammar. We added to the CV grammar the information that there is an onset, a nucleus and a coda. This means that the consonants in the onset and in the coda are assigned different weights. The grammar comprises 1025 rules. The grammar and an example tree was already introduced in section 3.</Paragraph>
      <Paragraph position="9"> Positional syllable structure grammar. Further linguistic knowledge is added to the syllable structure grammar. The grammar differentiate between monosyllabic words, syllables that occur in inital, medial, and final position. Furthermore the syllable structure is defined recursively.</Paragraph>
      <Paragraph position="10"> Another difference to the simpler grammar versions is that the syllable is devided into onset and rhyme. It is common wisdom that there are restrictions inside the onset and the coda, which are the topic of phonotactics. These restrictions are language specific; e.g., the phoneme sequence [ld] is quite frequent in English codas but it never appears in English onsets. Thus the feature position of the phonemes in the onset and in the coda is coded in the grammar, that means for example that an onset cluster consisting of 3 phonemes are ordered by their position inside of the cluster, and their position inside of the word, e.g. On.ini.1 (first onset consonant in an initial syllable), On.ini.2, On.ini.3. A fragment of the analysis grammar is shown in the following table:  Rule (6.1) shows a monosyllabic word consisting of one syllable. The second and third rules describe a bisyllabic word comprising an initial and a final syllable. The monosyllabic feature &amp;quot;one&amp;quot; is inherited to the daughter nodes, here to the onset, nucleus and coda in rule (6.4). Rule (6.5) depicts an onset that branches into two onset parts in a monosyllabic word. The numbers represents the position inside the onset. The subsequent rule displays the phoneme [f] of an initial onset. In rule (6.7) the nucleus of an initial syllable consists of the phoneme [O]. Rule (6.8) means that the initial coda only comprises one consonant, which is re-written by rule (6.9) to a mono-phonemic coda which consists of the phoneme [R]. The first of the following two trees recieves a higher overall probabability than the second one. The correct analysis of the transcribed word /claim/ [fORd@RUN] can be extracted from the most probable tree: [fOR][d@][RUN]. Note, all other analyses of [fORd@RUN] are very unlikely to occur.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML