File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1053_evalu.xml

Size: 3,558 bytes

Last Modified: 2025-10-06 13:58:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1053">
  <Title>Automatic Detection of Syllable Boundaries Combining the Advantages of Treebank and Bracketed Corpora Training</Title>
  <Section position="6" start_page="1" end_page="2" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We split our corpus into a 9/10 training and a 1/10 test corpus resulting in an evaluation (test) corpus consisting of 242047 words.</Paragraph>
    <Paragraph position="1"> Our test corpus is available on the World Wide Web  . There are two different features that characterize our test corpus: (i) the number of unknown words in the test corpus, (ii) and the number of words with a certain number of syllables. The proportion of the unknown words is depicted in Figure 4. The percentage of unknown words is almost 100% for the smallest training corpus, decreasing to about 5% for the largest training corpus. The &amp;quot;slow&amp;quot; decrease of the number of unknown words of the test corpus is due to both the high amount of test data (242047 items) and the &amp;quot;slightly&amp;quot; growing size of the training corpus. If the training corpus increases, the number of words that have not been seen before (unknown) in the test corpus decreases. Figure 4 shows the distribution of the number of syllables in the test corpus ranked by the number of syllables, which is a decreasing function. Almost 50% of the test corpus consists of monosyllabic words. If the number of syllables increases, the number of words decreases.</Paragraph>
    <Paragraph position="2"> The test corpus without syllable boundaries, is processed by a parser (Schmid (2000)) and the probabilistic context-free grammars sustaining the most probable parse (Viterbi parse) of each word. We compare the results of the parsing step with our test corpus (annotated with syllable boundaries) and compute the accuracy. If the parser correctly predicts all syllable boundaries of  a word, the accuracy increases. We measure the so called word accuracy.</Paragraph>
    <Paragraph position="3"> The accuracy curves of all grammars are shown in Figure 6. Comparing the treebank grammar and the simplest linguistic grammar we see that the accuracy curve of the treebank grammar monotonically increases, whereas the phoneme grammar has almost constant accuracy values (63%). The figure also shows that the simplest grammar is better than the treebank grammar until the treebank grammar is trained with a corpus size of 77.800. The accuracy of both grammars is about 65% at that point. When the corpus size exceeds 77800, the performance of the tree-bank grammar is better than the simplest linguistic grammar. The best treebank grammar reaches a accuracy of 94.89%. The low accuracy rates of the treebank grammar trained on small corpora are due to the high number of syllables that have not been seen in the training procedure. Figure 6 shows that the CV grammar, the syllable structure grammar and the positional syllable structure grammar outperform the treebank grammar by at least 6% with the second largest training corpus of about 1 million words. When the corpus size is doubled, the accuracy of the treebank grammar is still 1.5% below the positional syllable structure grammar.</Paragraph>
    <Paragraph position="4"> Moreover, the positional syllable structure grammar only needs a corpus size of 9600 to out-perform the treebank grammar. Figure 5 is a summary of the best results of the different grammars on different corpora sizes.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML