File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1053_intro.xml

Size: 3,197 bytes

Last Modified: 2025-10-06 14:01:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1053">
  <Title>Automatic Detection of Syllable Boundaries Combining the Advantages of Treebank and Bracketed Corpora Training</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In this paper we present an approach to supervised learning and automatic detection of syllable boundaries. The primary goal of the paper is to demonstrate that under certain conditions treebank and bracketed corpora training can be combined by exploiting the advantages of the two methods. Treebank training provides a method of unambiguous analyses whereas bracketed corpora training has the advantage that linguistic knowledge can be used to write linguistically motivated grammars.</Paragraph>
    <Paragraph position="1"> In text-to-speech (TTS) systems, like those described in Sproat (1998), the correct pronunciation of unknown or novel words is one of the biggest problems. In many TTS systems large pronunciation dictionaries are used. However, the lexicons are finite and every natural language has productive word formation processes. The German language for example is known for its extensive use of compounds. A TTS system needs a module where the words converted from graphemes to phonemes are syllabified before they can be further processed to speech. The placement of the correct syllable boundary is essential for the application of phonological rules (Kahn, 1976; Blevins, 1995). Our approach offers a machine learning algorithm for predicting syllable boundaries.</Paragraph>
    <Paragraph position="2"> Our method builds on two resources. The first resource is a series of context-free grammars (CFG) which are either constructed manually or extracted automatically (in the case of the treebank grammar) to predict syllable boundaries.</Paragraph>
    <Paragraph position="3"> The different grammars are described in section 4. The second resource is a novel algorithm that aims to combine the advantages of treebank and bracketed corpora training. The obtained probabilistic context-free grammars are evaluated on a test corpus. We also investigate the influence of the size of the training corpus on the performance of our system.</Paragraph>
    <Paragraph position="4"> The evaluation shows that adding linguistic information to the grammars increases the accuracy of our models. For instance, we coded the knowledge that (i) consonants in the onset and coda are restricted in their distribution, and (ii) the position inside of the word plays an important role. Furthermore, linguistically motivated grammars only need a small size of training corpus to achieve high accuracy and even out-perform the treebank grammar trained on the largest training corpus.</Paragraph>
    <Paragraph position="5"> The remainder of the paper is organized as follows. Section 2 refers to treebank training. In section 3 we introduce the combination of tree- null bank and bracketed corpora training. In section 4 we describe the grammars and experiments for German data. Section 5 is dedicated to evaluation and in section 6 we discuss our results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML