File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0608_evalu.xml

Size: 16,220 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0608">
  <Title>Probabilistic Context-Free Grammars for Phonology</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> First, we evaluate on a syllabification task for German. Second, we analyze linguistically the errors made by the German phonological parser on the evaluation corpus (word types). Third, and more important, we concentrate on a qualitative evaluation of syllable structure for German and English.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Evaluation on Syllabification
</SectionTitle>
      <Paragraph position="0"> The resulting probabilistic phonological grammar of German (Section 3) is evaluated on a syllabification task by comparing the maximum-probability parses of all raw phoneme strings in the test corpus (242,047 word tokens, 24,735 word types) with their annotated bracketed variants. As evaluation measure, we used &amp;quot;word accuracy&amp;quot; which computes the rate of words with all predicted syllable brackets exactly matching the annotated syllable brackets. The evaluation shows that our phonological grammar for German achieves 96.88% word accuracy on word tokens, and 90.33% on word types.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Error Analysis
</SectionTitle>
      <Paragraph position="0"> We analyze the results of the German phonological parser on the evaluation corpus consisting of word types. Out of 24,735 words types 2391 words contained wrongly predicted syllable boundaries. We analyze every tenth word of the incorrect words, which means we look at 239 items. There are 243 errors found in the analyzed words. Due to the fact that there can occur more than one error in one word, the number of errors is higher than the number of items. We find that 72.42% of the errors are made for consonants uncorrectly assigned to the onset, whereas only 27.57% errors are made for consonants wrongly assigned to the coda. The tendency that more errors occur in predicting the onset agrees with the findings of M&amp;quot;uller (2001b). The main errors appear when a a47 ta48 , a47 Ra48 , or an a47 na48 is predicted to be an onset consonant. The main errors found in predicting the coda consonants mainly occur with a47 ka48 , a47 sa48 , and a47 pa48 .</Paragraph>
      <Paragraph position="1"> If we investigate the errors made on the linguistic level, the most frequent error occurs with word boundaries (98 cases). However, a further error source is found with syllable boundaries occuring in conjunction with prefixes and suffixes. The most frequent error appears with syllable boundaries after prefixes like /ver-/, /er-/, and /un-/, whereas with suffixes the most frequent error appears with /-lich/.</Paragraph>
      <Paragraph position="2"> Foreign words, which are subject to different phonotactic constraints, seem to be a minor source of errors (20 cases). Thus, we can see that most of the errors are found in conjunction with morphological  entities. This might point out that a further morphological level could help to disambiguate syllabification alternatives.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Qualitative Evaluation
</SectionTitle>
      <Paragraph position="0"> The evaluation is carried out for both English and German. First, we compare the complexity of words, syllables, and syllable parts. Second, we analyze the probabilities of grammar rules involving phonemic terminal nodes. Unfortunately, due to space constraints and the large size of our derived probabilistic phonological grammars, only preliminary results can be presented.</Paragraph>
      <Paragraph position="1"> Table 1 displays the occurrence frequencies of onsets and codas (of different size and different syllable position), counted on the basis of the training corpora for English and German (see Section 3). The following three complexity analyses are carried out on the basis of this table.</Paragraph>
      <Paragraph position="2"> Word complexity. German words tend to be more complex than English words. The German training corpus comprises 48.7% 1 monosyllabic words, whereas 67.4% are found in the English training corpus. The high frequency of occurrence of monosyllabic words justifies the separate treatment of those words.</Paragraph>
      <Paragraph position="3"> Syllable complexity. German syllables usually consist of onset and rhyme. An onset is observed in initial syllables (73%) 2, in medial syllables (94%), in final syllables (94%), and in monosyllabic words (73%). A coda is found in initial syllables (42%), in medial syllables (34%), in final syllables (78%), and  in monosyllabic words (83%).</Paragraph>
      <Paragraph position="4"> English syllables. An onset is observed in initial syllables (72.2%), in medial syllables (94%), in final syllables (95%), and in monosyllabic words (68.3%). A coda is found in initial syllables (29.7%), in medial syllables (21.2%), in final syllables (82.3%), and in monosyllabic words (69%).</Paragraph>
      <Paragraph position="5"> Onset and coda complexity. English and German syllables prefer simple onsets and codas.</Paragraph>
      <Paragraph position="6"> English onsets. For both initial and medial syllables, a single consonant is found (80%), two consonants (18%), and three consonants (less than 1%). For both final syllables and monosyllabic words, one consonant is observed (85%), two consonants (13%), and three consonants (less than 0.9%).</Paragraph>
      <Paragraph position="7"> German onsets. For both initial and medial syllables, one consonant is found (82%), two consonants (15%), and three consonants (1%). For both mono-syllabic words and final syllables, a single consonant is found (90%), two consonants (7%), and three consonants (less than 1%).</Paragraph>
      <Paragraph position="8"> English codas. For both initial and medial syllables, one consonant is observed (95%), and two consonants (5%). For both final syllables and monosyllabic words, one consonant is found (77%), two consonants (20%), and three consonants (2%).</Paragraph>
      <Paragraph position="9"> German codas. For both initial and medial syllables, one consonant is observed (88%), two consonants (10%), and three consonants (about 1%). In final syllables, one consonant is found (82.8%), two consonants (15.2%), three consonants (1.7%), and four consonants (0.08%). In monosyllabic words, one consonant occurs (74.6%), two consonants (23.1% ), three consonants (1.8%), and four consonants (0.3%).</Paragraph>
      <Paragraph position="10"> Nuclei. Table 2 displays the nuclei found for English and German. The symbol &amp;quot;-&amp;quot; indicates that the phoneme has not been observed in the training corpus. However, the corresponding grammar rules receive a (very small) positive probability according to our smoothing procedure. Note, we do not display those phonemes in the tables which are marked with &amp;quot;-&amp;quot; for all positions . Moreover, the symbol &amp;quot;a51a53a52a54a23a55a52a56a52a57a38 &amp;quot; indicates a probability of less than 0.001 (which means a occurrence frequency of less than 0.1%).</Paragraph>
      <Paragraph position="11"> English nuclei. The most likely nuclei in initial syllables are a47 @, I, E, O, &amp;, Va48 (16.7%, 16.6%, 12.6%, 7.2%, 7%, 6.9%), in medial syllables a47I, @, E, eIa48 (28.2%, 26.8%, 10.5%, 7.7%), in final syllables a47@, Ia48 (38.6%, 36.7%), and in monosyllabic words a47@, I,&amp;, O, eI, u:a48 (22%, 17.1%, 9.5%, 8.2%, 6.9%, 6.4%). Furthermore, in monosyllabic words, 33.6% of the nuclei are long vowels/diphthongs, 29.1% in initial syllables, 23.6% in medial syllables, and 18% in final syllables.</Paragraph>
      <Paragraph position="12"> German nuclei. In initial syllables, the most probable nuclei are a47 E, a, aI, @, e:, a:, I, o:, i:, O,a48 (13.%, 10.7%, 10.5%, 8.2%, 7.5%, 6.6%, 6.4%, 6.1%, 5.4%, 5.1%), in medial syllables, a47@, i:, I, a, E, a:, e:, o:a48 (18%, 13.5%, 12.9%,9.2%, 9.1%, 5.6%, 5.4%, 5.2%), in final syllables a47 @, I, U, aa48 (69.3%, 4.9%, 4.4%, 3.6%), and in monosyllabic words a47 e:, I, i:, a, U, E, aI, O, aUa48 (17.1%, 16.3%, 12.4%, 10.3%, 8.2%, 6.1%, 6%, 5.2%, 5.1%). Generally, we can observe that long vowels/diphthongs are more likely in monosyllabic words (52.6%) than in all other syllable positions (48.3% in initial syllables, 42% in medial syllables, and 13.4% in final syllables).</Paragraph>
      <Paragraph position="13"> Mono-consonantal onsets and codas. Table 3 and Table 4 display the onsets and codas consisting of 1 consonant.</Paragraph>
      <Paragraph position="14"> German onsets. The most probable consonants in initial syllables are a47 f, v, g, b,z, m, d, k, ha48 , (11.5%, 11.5%, 9.8%, 9.6%, 7.8%, 7.8%, 7.3%, 6.4%, 6.2% ), in medial syllables a47 l, t, g, n, R, d, za48 (12.1%, 11.6%, 10.3%, 7.8%, 7.4%, 7.2%, 6.9%), in final syllables a47t, n, d, R, la48 (18.4%, 12%, 10.6%, 8.2%, 7.6%), and in monosyllabic words a47 d, z, f, n, ma48 (45.4%, 10.7%, 9.9%, 7%).</Paragraph>
      <Paragraph position="15"> English onsets. In initial syllables, the most probable consonants are a47 s, k, r, m, d, pa48 (11.4%, 11.3%, 9.8%, 8.7%, 8.4%, 8%), in medial syllables a47 t, s, l, n, r, va48 (12.5%, 9.8%, 9.7%, 9.1%, 8%, 8%), in final syllables a47 t, l, d, S, s, ra48 (18.5%, 9.5%, 7.5%, 7.3%, 6.7%, 6.5%), and in monosyllabic words a47D, t, w, b, ha48 (25%, 10%, 10%, 6%, 6%).</Paragraph>
      <Paragraph position="16"> German codas. In initial position, the most likely consonants are a47 R, na48 (35.6%, 26.9%), in medial syllables a47R, na48 (31.5%, 28.7%), in final syllables a47 n, Ra48 (50.1%, 18.3%), and in monosyllabic words a47n, R, s, xa48 (27.3%, 27.1%, 15.6%, 9.6%).</Paragraph>
      <Paragraph position="17"> English codas. In initial syllables, the most dominant consonants are a47 n, k, ma48 (43.5%, 11.8%,  das). Due to space constraints, we omitted to display clusters of 4-5 consonants, but our analysis can be found elsewhere (M&amp;quot;uller, to appear 2002). Clusters with more than 5 consonants have not been found in our corpora. Furthermore, for German, no onsets comprising 4 consonants, and for English, no codas occur comprising 5 consonants. Last, for German, there is only one consonant cluster a47 Rnstsa48 appearing in words like &amp;quot;Ernsts&amp;quot;, the genitive case of the proper name &amp;quot;Ernst&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> As M&amp;quot;uller (2001a), we presented a method for prediction of syllable boundaries using phonological probabilistic context-free grammars for German. However, our approach performs slightly better (96.88% word accuracy on word tokens versus 96.49%). Van den Bosch (1997) reports a word error rate of 2.22% on English syllabification using inductive learning. Due to the feature &amp;quot;cluster size&amp;quot;, which was not used by M&amp;quot;uller (2001a), we are able to give an extensive qualitative evaluation of syllable structure considering syllable positions, as well as the complexities of consonant clusters, and the position of a consonant within a cluster. Since our approach is multilingual (only training is language dependent), we evaluated two languages (German and English) showing that probabilistic context-free grammars add linguistic knowledge to phonology. In contrast to theoretical approaches, we focus on syllable structures that are preferred in a certain language. Theoretical phonotactic approaches like (Hall (1992), Wiese (1996), F'ery (1995)) describe possible syllable structures or German, and Kenstowicz (1994), Morelli (1999) for English. There are many more approaches dealing with syllable structure. For instance, Kiraz and M&amp;quot;obius (1998) develop multilingual syllabification models on the basis of a pronunciation dictionary. Partial English syllable structure is described by Pierrehumbert (1994), who also used a dictionary. A more general model was introduced by M&amp;quot;uller et al. (2000), who used a clustering algorithm to induce English and German syllable classes. However, the approach treats the onsets and codas as one string. In our method, we describe in more detail the internal structure of onsets and codas. Our model has several advantages. (i) We believe that the syllable structure of all words occurring in a certain language can be described by an elaborated context-free grammar. (ii) Moreover, using probabilistic context-free grammars, alternative syllabifications of phoneme strings can be disambiguated. (iii) Our model is able to analyze unexpected phoneme strings. For instance, the onset a47 bRda48 of the proper name &amp;quot;Brdaric&amp;quot; (a47 bRda:RItSa48 ) is not allowed according to German phonotactics. Table 7 correctly displays that neither a47 ba48 occurs as a first, nor a47 Ra48 as a second, or a47 da48 as a third consonant in initial triconsonantal German onset clusters.</Paragraph>
    <Paragraph position="1"> Due to the smoothing procedure, our model syllabifies this name as a47 bRda:a48a58a47 RItSa48 although the onset a47 bRda48 has never been observed in the German training corpus. (iv) The syllable structure of nonsense words can be predicted. The model can be exploited in two ways: first, it predicts the most probable syllabification, second, it can be used to model the lexical decision task. An example of English words is mentioned by Pierrehumbert (1994). She compared &amp;quot;bistro&amp;quot; (a47 bIstr@Ua48 ) as a possible word, with a good word &amp;quot;bimplo&amp;quot; (a47 bImpl@Ua48 ), and a bad word &amp;quot;bilflo&amp;quot; (a47 bIlfl@Ua48 ). The four possible syllabifications are a47 bIma48a58a47 pl@Ua48 , a47 bIa48a58a47 mpl@Ua48 , a47 bImpa48a58a47 l@Ua48 , and a47 bImpla48a58a47 @Ua48 . The most probable syllable structure for &amp;quot;bimplo&amp;quot; is a47 bIma48a58a47 pl@Ua48 (1.8403e-13). For the twosyllabic word &amp;quot;bilflo&amp;quot;, the model assigns the highest probability to a syllable boundary between a47 la48 and a47 fla48 . Out of the four possible syllabifications for the real word &amp;quot;bistro&amp;quot; a47bIstra48a58a47@Ua48 is the most probable syllable structure. Although the triconsonantal cluster should be rather an onset cluster than a coda cluster, the model prefers a47 stra48 as a cluster, whereas a syllable boundary is added between a47 ma48 and a47 pla48 , and a47 la48 and a47 fla48 . A further example mentioned in the literature is a47 brIka48 , a47 blIka48 , and a47 bnIka48 . The first one is a possible word of English a47 brIka48 , which receives a probability of 1.16e-08, the second non-occurring word a47 blIka48 (7.2391e-09), and the non-English word a47 bnIka48 (4.3249e-09). The least probable one is the non-English word, followed by the non-occurring one, and the highest probability is assigned to the real word brick.</Paragraph>
    <Paragraph position="2"> Beside the good performance of the current models in applications, further improvements of the present approach can possibly be achieved by embedding more prior phonotactic knowledge. For example, it might be useful to model the distribution of a consonant dependent on the previous one (Menzel, 2001). In future work, we will investigate this issue by using head-lexicalized probabilistic context-free grammars, like those suggested by Carroll and Rooth (1998), where the consonant cluster a47 StRa48 would be analyzed as:</Paragraph>
    <Paragraph position="4"> Here, the lexical choice events express the desired phonotactic feature. Moreover, it would be interesting to incorporate the stress feature.</Paragraph>
    <Paragraph position="5"> Our linguistic evaluation of the errors point out that most errors of the phonological parser occur in conjunction with morphological phenomena like prefixes, suffixes and word boundaries. This might point out that a further morphological layer could improve word accuracy (see also Meng (2001)).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML