XML Viewer - w98-0805

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0805_metho.xml
Size: 18,646 bytes
Last Modified: 2025-10-06 14:15:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0805">
  <Title>The value of minimal prosodic information in spoken language</Title>
  <Section position="4" start_page="40" end_page="41" type="metho">
    <SectionTitle>
2 MARSEC : Machine Readable
Spoken English Corpus
</SectionTitle>
    <Paragraph position="0"> Since trained speakers will be producing the subtitles a corpus of speech from professional broadcasters is appropriate for this initial investigation. The MARSEC corpus has been mainly collected from the BBC, and is available free on the web (see references). It is marked with prosodic annotations but not with POS tags. \Ve have used part of the corpus, just over 26,000 words, comprising the 4 categories of news commentary (A), news broadcasts (B), lectures aimed at a general audience (C) and lectures aimed at a restricted audience (D).</Paragraph>
    <Paragraph position="1"> The prosodic markup has been done manually, by two annotators. Some passages have been done by both, and we see that there is a general consensus, but some differing decisions.</Paragraph>
    <Paragraph position="2"> Interpreting the speech data has an element of subjectivit3: In Table 1 we show some sample data as we used it, in which only the major and minor tone unit boundaries are retained. \Vhen passages were marked up twice, we chose one in an arbitrary way, so that each annotator was chosen about equally.</Paragraph>
    <Section position="1" start_page="40" end_page="41" type="sub_section">
      <SectionTitle>
2.1 Comparison of automated and
</SectionTitle>
      <Paragraph position="0"> manual markup of pauses We suggest that the production of this type of data may be technically feasible for a trained speaker using an ASR device, and is worth investigating further.</Paragraph>
      <Paragraph position="1"> (Huckvale and Fang, 1996) describe their method of automatically capturing pause information for the PROSICE corpus. The detection of major pauses is technically straightforward: they find regions of the signal that fall below a certain energy threshold (60Db) for at least 250ms. Minor pauses are more difficult to find, since they can occur within words, and their detection is integrated into the signal / word alignment process.</Paragraph>
      <Paragraph position="2"> \\re find that in that in the manually annotated MARSEC corpus, the ratio of words to major pauses is approximately 17.8, to minor pauses 5.4, or 4.1 if both type of pause are taken together. (Huckvale and Fang, 1996) quote figures that work out at 7.7 for major pauses, 30.8 for minor ones, or 6.15 taken together. This suggests that there is some discrepancy between  Key: \[I is major pause \] is minor pause annotator 1 annotator 2 we we  what is considered a major or minor pause. However, taking both together the resuks from the automated system is not out of line with the manual one on this statistical measure.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="41" end_page="43" type="metho">
    <SectionTitle>
3 Entropy indicators
</SectionTitle>
    <Paragraph position="0"> The entropy is a measure, in a certain sense, of the degree of unpredictability. If one representation captures more of the structure of language than another, then the entropy measures should decline.</Paragraph>
    <Paragraph position="1"> If H represents the entropy of a sequence and</Paragraph>
    <Paragraph position="3"> ber of choices.</Paragraph>
    <Section position="1" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
3.1 Definition of entropy
</SectionTitle>
      <Paragraph position="0"> Let ..4 be an alphabet, and X be a discrete random variable. The probability mass function is then p(x), such that</Paragraph>
      <Paragraph position="2"> For an initial investigation into the entropy of letter sequences the x's would be the 26 letters of the standard alphabet.</Paragraph>
      <Paragraph position="3"> The entropy H(X) is defined as</Paragraph>
      <Paragraph position="5"> If logs to base 2 are used, the entropy measures tile minimum number of bits needed on average to represent X: the wider the choice the more bits will be needed to describe it.</Paragraph>
      <Paragraph position="6"> \Ve talk loosely of the entropy of a sequence, but more precisely consider a sequence of symbols Xi which are outputs of a stochastic process. We estimate the entropy of the distribution of which the observed outcome is typical. Further references are (Bell et al., 1990; Cover and Thomas, 1991), or, for an introduction, (Charniak, 1993, Chapter 2).</Paragraph>
      <Paragraph position="7">  Though we are investigating sequences of words, the subject is introduced by recalling Shannon's well known work on the entropy of letter sequences (Shannon, 1951). He demonstrated that the entropy will decline if a representation is found that captures (i) the context and (ii) the structure of the sequence.</Paragraph>
      <Paragraph position="8"> Shannon produced a series of approximations to the entropy H of written English, which suecessively take more of the statistics of the lan- null guage into account. H0 represents the average number of bits required to determine a letter with no statistical information. Thus, for an alphabet of 16 symbols H0 = 4.0.</Paragraph>
      <Paragraph position="9"> H1 is calculated with information on single letter probabilities. If we knew, for example, that letter e had probability of 20~ of occurring while z had 1% we could code the alphabet with, on average, fewer bits than we could without this information. Thus H1 would be lower than H0.</Paragraph>
      <Paragraph position="10"> H2 uses information on the probability of 2 letters occurring together; Hn, called the n-gram entropy, measures the amount of entropy with information extending over n adjacent letters of text, 1 and Hn _&lt; Hn-1. As n increases from 0 to 3, the n-gram entropy declines: the degree of predictability is increased as information from more adjacent letters is taken into account. This fact is exploited in games where the contestants have to guess letters in words, such as the &amp;quot;Shannon game&amp;quot; or &amp;quot;Hangman&amp;quot; (Jelinek, 1990).</Paragraph>
      <Paragraph position="11"> The formula for calculating the entropy of a sequence is given in (Lyon and Brown, 1997).</Paragraph>
      <Paragraph position="12"> An account of the process is also given in (Cover and Thomas, 1991, chapter2) and (Shannon, 1951).</Paragraph>
    </Section>
    <Section position="2" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
3.2 Entropy and structure
</SectionTitle>
      <Paragraph position="0"> The entropy can also be reduced if some of the structure of the letter strings is captured. As Shammn says &amp;quot;a word is a cohesive group of letters with strong internal statistical influences&amp;quot; so the introduction of the space character to separate words should lower the entropy H2 and Ha. With an extra symbol in the alphabet H0 will rise. There will be more potential pairs and triples, so H2 and H3 could rise. However, as the space symbol will prevent &amp;quot;irregular&amp;quot; letter sequences between words, and thus reduce the unpredictability H~ and Ha do in fact decline.</Paragraph>
      <Paragraph position="1"> For instance, for the words  It differs from that used by (Bell et al., 1990).</Paragraph>
    </Section>
    <Section position="3" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
3.3 The entropy of ASCII data
</SectionTitle>
      <Paragraph position="0"> For other representations too, the insertion of boundary markers that capture the structure of a sequence will reduce the entropy. Gull and Skilling (1987) report on an experiment with a string of 32,768 zeroes and ones that are known to be ASCII data organised in patterns of 8 as bytes, but with the byte boundary marker missing. By comparing the entropy of the sequence with the marker in different positions the boundary of the data is &amp;quot;determined to a quite astronomical significance level&amp;quot;.</Paragraph>
    </Section>
    <Section position="4" start_page="42" end_page="43" type="sub_section">
      <SectionTitle>
3.4 The entropy of word sequences
</SectionTitle>
      <Paragraph position="0"> This method of analysis can also be applied to strings of words. The entropy indicator will show if a sequence of words can be decomposed into segments, so that some of the structure is captured. Our current work investigates whether pauses in spoken English perform this role.</Paragraph>
      <Paragraph position="1"> Previously we showed how the entropy of text mapped onto part-of-speech tags could be reduced if clauses and phrases were explicitly marked (Lyon and Brown, 1997). Syntactic markers can be considered analogous to spaces between words in letter sequence analysis. They are virtual punctuation marks.</Paragraph>
      <Paragraph position="2"> Consider, for example, how subordinate clauses are discerned. There may be an explicit opening marker, such as a 'wh' word, but often there is no mark to show the end of the clause. If markers are inserted and treated as virtual ptmctuation some of the structure is captured and the entropy declines. A sentence without opelfing or closing clause boundary markers, like The shirt he wants is in the wash.</Paragraph>
      <Paragraph position="3"> can be represented as The shirt { he wants } is in the wash.</Paragraph>
      <Paragraph position="4"> This sentence can be given part-of-speech tags, with two of the classes in the tagset representing the symbols '{' (virtual-tagl) and '}' (virtual-tag2). The ordinary part-of-speech tags have probabilistic relationships with the virtual tags in the same way that they do with each other. The pairs and triples generated by the second string exclude (noun, pronoun), (noun, pronoun, verb) but include, for instance, (noun, virtual-tag1), (noun, virtual-tag1, pronoun) null  Using this representation, the entropy, H2 and H3, with virtual tags explicitly raarking some constituents is lower than that without the virtual tags. In a similar way the words from a speech signal can be segmented into groups, with periodic pauses.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="43" end_page="44" type="metho">
    <SectionTitle>
4 Results from analysis of the
MARSEC corpus
</SectionTitle>
    <Paragraph position="0"> We can measure the entropy H0, Hi, .H2 and H3 for the corpus with and without prosodic markers for major and minor pauses. However, rather than use words themseh'es we map them onto part-of-speech tags. This reduces an indefinite number of words to a limited number of tags, and makes the investigation computationally feasible. We expect * H0 will be higher with a marker, since the alphabet size increases * H1, which takes into account the single element probabilities, will increase or decrease depending on the frequency of the new symbol.</Paragraph>
    <Paragraph position="1"> * H2 and H3 should fall if the symbols representing prosodic markers capture some of the language structure. We expect: H3 to show this more than H2.</Paragraph>
    <Paragraph position="2"> * If instead of the real pause markers mock ones are inserted in an arbitrary thshion: we expect H to rise in all cases.</Paragraph>
    <Paragraph position="3"> To conduct this investigation the MARSEC corpus was taken off the web; and pre-processed to leave the words plus major and minor tone unit boundaries, or pauses. Then it was automatically tagged, using a version of the Claws tagger 2. These tags were mapped onto a smaller tagset with 26 classes, 28 including the major and minor pauses. The tagset is given in the appendix. Random inspection indicated about 96% words correctly tagged.</Paragraph>
    <Paragraph position="4"> Then the entropy of part of the corpus was calculated (i) for words only (ii) with minor pauses represented (iii) with major pauses represented and (ix') with major and minor pauses represented. Results are shown in Table 2, and in Figure 1.</Paragraph>
    <Paragraph position="5"> ~Claws4, supplied by the University of Lancaster, described by Garside (1987) H3 is calculated in two different ways. First, the sequence of tags is taken as an uninterrupted string (column H3 (1) in Table 2). Secondly, we take the major pauses as equivalent to sentence ends, points of segmentation, and omit an)&amp;quot; triple that spans 2 sentences (column H3 (2)). In practice, this will be a sensible approach. null This experiment shows how the entropy H3 declines when information on pauses is explicitly represented. Though there is not a transparent mapping from prosody to structure, there is a relationship between them which can be exploited. These experiments indicate that English language used by professional speakers can be coded more efficiently when pauses are represented.</Paragraph>
    <Section position="1" start_page="43" end_page="43" type="sub_section">
      <SectionTitle>
4.1 Comparison with arbitrarily placed
</SectionTitle>
      <Paragraph position="0"> pauses Compare these results to those of another experiment where the corpora of words only were taken and pauses inserted in an arbitrary manner. Major pauses were inserted every 19 words, minor pauses every 7 words, except where there is a clash with a major pause. The numbers of major and minor pauses are comparable to those in the real data. Results are shown in Table 3. H2 and H3 are higher than the comparable entropy levels for speech with pauses inserted as they were actually spoken. Moreover, the entropy levels are higher than for speech without any pauses: the arbitrary insertion has disrupted the underlying structure, and raised the unpredictability of the sequence.</Paragraph>
    </Section>
    <Section position="2" start_page="43" end_page="44" type="sub_section">
      <SectionTitle>
4.2 Entropy and corpus s\]ze
</SectionTitle>
      <Paragraph position="0"> Note that we are interested in comparative entropies. The entropy converges slowly to its asymptotic value as the size of the corpora increases, and this is an upper bound on entropy values for smaller corpora. Ignoring this may give misleading results (Farach and et al., 1995). The reason why entropy is underestimated for small corpora comes from the fact that we approximate probabilities by frequency counts, and for small corpora these may be poor approximations. The count of pairs and triples is the basis for the probability estimates, and with small corpora many of the triples in particular that will show later have not occurred.</Paragraph>
      <Paragraph position="1"> Thus the entropy and perplexity measures un- null pauses represented, 26 without them. The entropy is calculated with trigrams spanning a major pause omitted, as in Table 2 column H3 (2) derestimate their true values.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="44" end_page="46" type="metho">
    <SectionTitle>
5 The subtitling task
</SectionTitle>
    <Paragraph position="0"> We now show how the investigation described here is relevant to the subtitling task. Trained subtitlers are employed to output real time captions for some TV programmes, currently as a stream of type written text. In future this may be done with an ASR system. In either case, the production of the caption is followed by the task of displaying the text so that line breaks occur in natural positions. The type of programmes for which this is needed include contemporaneous commentary on on news events, sports, studio discussions, and chat shows.</Paragraph>
    <Paragraph position="1"> Caption format is limited by line length, and there are usually at most 2 lines per caption. Some examples of subtitles that have been displayed are taken from the broadcast commentary o11 Princess Diana's funeral, with line breaks as shown: As I said the great tenor bell is half muffled with a piece of leather around its clapper They now bear the coffin of the Princess of Wales into Westminster Abbey.</Paragraph>
    <Paragraph position="2"> An example ~oma chat show is: Who told you that you resemble Mr Hague?  positions : a major pause every 19 words, minor pause every 7 words (except for clashes with major) I work at a golf club and we have lots of societies and groups come in.</Paragraph>
    <Paragraph position="3"> The quality of the subtitles can be improved by placing the line breaks and caption breaks in places where the text would be naturally segmented. Though this is partially a subjective process, a style book can be produced that gives agreed guidelines.</Paragraph>
    <Paragraph position="4"> Some of the poor line breaks can be readily corrected, but the production of a high quality display overall is not a trivial task. The pauses in speech do not map straight onto suitable line breaks, but they are a significant source of information. In this work we have been considering the output of trained speakers, or the recording of rehearsed speech. This differs from ordinary, spontaneous speech, where hesitation phenomena may have a number of causes. However, in the type of speech we are processing we have shown that the use of pauses captures some syntactic structure. An example given by (Ostendorf and Vielleux, 1994) is Mary was amazed Ann Dewey was angry.</Paragraph>
    <Paragraph position="5"> which was produced as Mary was amazed \[\[ Ann Dewey was angry.</Paragraph>
    <Paragraph position="6"> To illustrate a problem of text segmentation consider how conjunctions should be handled. Now conjunctions join like with like: verb with verb, noun with noun, clause With clause, and so on. If a conjunction joins two single words, such as ::black and blue&amp;quot; we do not want it to trigger a line break. However, it may be a reasonable break point if it joins two longer components. Consider the following example from the MARSEC corpus: it held its trajectory for one minute I flashes burst I from its wings I and rockets exploded \] safely behind us II The word &amp;quot;and&amp;quot; without the pause marked is part of a trigram :;noun conjunction noun&amp;quot; which would typically stick together. In fact, it actually joins two sentences, and would be a good point for a break. By including the pause marker we can identify this.</Paragraph>
    <Paragraph position="7"> The proposed system for finding line breaks will integrate rule based and data driven components. This approach is derived from earlier work in a related field in which a partial parser has been developed (Lyon and Frank, 1997). It will be based on a part-of-speech trigram model combined with lexical information. We will be able to develop a better language model if we explicitly include a representation for major and minor pauses.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML