File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0608_metho.xml
Size: 7,031 bytes
Last Modified: 2025-10-06 14:08:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0608"> <Title>Probabilistic Context-Free Grammars for Phonology</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, </SectionTitle> <Paragraph position="0"> Morphological and Phonological Learning: Proceedings of the 6th Workshop of the grammar rules.</Paragraph> <Paragraph position="1"> In sum, we aim to show that our method (i) models all possible words of a language, (ii) models how likely certain structures are used (in comparison to pure dictionary-based approaches), (iii) yields good results in an application-oriented evaluation, (iv) is able to disambiguate competing structures, (v) can be easily applied to other languages, (vi) produces mathematically well-defined models.</Paragraph> <Paragraph position="2"> The paper is organized as follows. We present our method in Section 2, the experiments in Section 3, and our evaluation in Section 4. In Section 5, we discuss the results, and in Section 6, we conclude.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Method </SectionTitle> <Paragraph position="0"> We build on the novel approach of M&quot;uller (2001a) which aims to combine the advantages of treebank and bracketed corpora training. In general, this approach consists of four steps: (i) writing a (symbolic i.e. non-probabilistic) context-free phonological grammar with syllable boundaries, (ii) training this grammar on a large automatically transcribed and syllabified corpus, (iii) transforming the resulting probabilistic phonological grammar by dropping the syllable boundaries, and (iv) predicting syllable boundaries of unseen phoneme strings by choosing their most probable phonological tree according to the transformed probabilistic grammar.</Paragraph> <Paragraph position="1"> The advantages of this approach are, that simple and efficient supervised training on bracketed corpora can be used (the brackets guarantee that all syllabified words of the training corpus receive only one single analysis), and that raw phoneme strings can be parsed and syllabified after the grammar transformation.</Paragraph> <Paragraph position="2"> Preserving these advantages, our approach differs in several important details. First, we write a more advanced phonological grammar for German, yielding a more fine-grained probabilistic model of syllable structure. Second, it is easily possible to enrich our phonological grammar by adding grammar rules for missing phonemes to adapt our phonological grammar to other languages (here English). Third, in addition to an evaluation on a real-world task (syllabification for German), we qualitatively evaluate the resulting probabilistic versions of our phonological grammar for German and English.</Paragraph> <Paragraph position="4"> according to our phonological grammar for German.</Paragraph> <Paragraph position="5"> Our phonological grammar divides a word into syllables, which are in turn rewritten by onset, nucleus, and coda. Furthermore, the phonological grammar differentiates between monosyllabic and polysyllabic words. In polysyllabic words, the syllables are divided into syllables appearing wordinitially, word-medially, and word-finally. Additionally, the grammar distinguishes between consonant clusters of different sizes (ranging from one to five consonants), as well as between consonants occurring in different positions within a cluster. Figure 1 displays the structure of the German word &quot;Abfall&quot; (waste) according to our phonological grammar.</Paragraph> <Paragraph position="6"> In the following sections, we especially focus on the rewriting rules involving phonemic terminal nodes: Xa23a25a24a17a23a27a26a28a23a25a29a31a30 a32 and Ya23a25a24a33a30 a34a35a23 The rules of the first type bear three of the above mentioned features for a consonant a32 inside an onset or a coda (X=On, Cod), namely: the position of the syllable in the word (a24 =ini, med, fin, one), the cluster size (a29a37a36a39a38a40a23a41a23a41a23a43a42 ), and the position of a consonant within a cluster (a26a44a36a45a38a40a23a41a23a41a23a46a42 ). Obviously, vowels or diphthongs a34 of a nucleus (Y=Nucleus) do not need the position and size features (a29 and a26 ). The probabilities of these phonological rules (after supervised training) are exactly the basis for our description and evaluation of the syllable parts in Section 4.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> In the following, we describe two experiments. The first experiment investigates syllable structure for German. In the second experiment, we generalize and apply the method to another language (English).</Paragraph> <Paragraph position="1"> Experiment with German data.</Paragraph> <Paragraph position="2"> First, we manually write a phonological grammar for German consisting of 2,394 context-free rules.</Paragraph> <Paragraph position="3"> If compared to the most successful grammar constructed by M&quot;uller (2001a), our grammar is enriched with an additional feature: the size of the onsets and codas. Second, we extracted a training corpus of 2,127,798 words (3,961,982 syllables) from a German newspaper, the Stuttgarter Zeitung, and an additional corpus of 242,047 words for testing. All words are looked up in the German part of the CELEX (Baayen et al., 1993) yielding transcribed and syllabified corpora. As phoneme set, we used the symbols from the English and German SAMPA alphabet (Wells, 1997). In contrast to M&quot;uller (2001a), we did not investigate smaller training corpora, since we are interested in maximal phonological knowledge about internal word structure. Third, we train the phonological context-free grammar on the training corpus using the supervised method presented in Section 2. Additionally, due to events not occurring in the training data, we use the implemented smoothing procedure of the LoPar system (Schmid, 2000) producing rules with positive probabilities.</Paragraph> <Paragraph position="4"> Experiment with English data.</Paragraph> <Paragraph position="5"> In this experiment, we show that our method can be easily applied to other languages. We create a second training corpus of the same size of 2,123,081 words from the British National Corpus. The words are looked-up in the English part of the CELEX.</Paragraph> <Paragraph position="6"> Furthermore, the context-free grammar is extended by rules for all possible English phonemes. This means, we add preterminal rules for phonemes not occurring in German words, e.g., rules for the apicodental phoneme a47Ta48 (appearing in the word this). This (semi-automatic) procedure yields an English phonological grammar consisting of 4,418 rules, which is trained on the new corpus.</Paragraph> </Section> class="xml-element"></Paper>