File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1004_metho.xml

Size: 20,352 bytes

Last Modified: 2025-10-06 14:07:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1004">
  <Title>Issues in the Transcription of English Conversational Grunts</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
http://www.sanpo.t.u-tokyo.ac.jp/~ nigel/
Abstract
</SectionTitle>
    <Paragraph position="0"> Conversational grunts, such as uhhuh, un-hn, rnrn, and oh are ubiquitous in spoken English, but no satisfactory scheme for transcribing these items exists. This paper describes previous approaches, presents some facts about the phonetics of grunts, proposes a transcription scheme, and evaluates its accuracy. 1</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="29" type="metho">
    <SectionTitle>
1 The Importance of
Conversational Grunts
</SectionTitle>
    <Paragraph position="0"> :Conversational grunts, such as uh-huh, un-hn~ ram, and oh are ubiquitous in spoken English.</Paragraph>
    <Paragraph position="1"> In our conversation data, these grunts occur an average of once every 5 seconds in American English conversation. In a sample of 79 conversations from a larger corpus, Switchboard, urn was the 6th most frequent item (after /, and, the, you, and a), and the four items uh, uh-huh, um and urn-hum accounted for 4% of the total. These sounds are not only frequent, they are important in language use.</Paragraph>
    <Paragraph position="2"> To mention just one example, people learning English as a second language are handicapped in informal interactions if they cannot produce and recognize these sounds.</Paragraph>
    <Paragraph position="3"> 1I would like to tb.nlr Takeki Kamiyama for phonetic label cross-checld-g, all those who let me record their conversations, and the anonymous referees; and also the Japanese 1Vr;nlqtry of Education, the Sound Technology Promotion Foundation, the Nakayama Foundation, the Inamori Foundation, the International Communications Fonndation and the Okawa Foundation for support.</Paragraph>
    <Paragraph position="4"> Just to be clear about definitions, in this paper 'grunts 2' means sounds which are ~not words', where a prototypical &amp;quot;word&amp;quot; is a sound having 1. a clear meaning, 2. the ability to participate in syntactic constructions, and 3. a phonotactically normal pronunciation. For example, uh-huh is a grunt since it has no referential meaning, has no syntactic affinities, and has salient breathiness. In this paper 'conversational' refers to sounds which occur in conversation and are at least in part directed at the interlocutor, rather than being purely self-directed 3. Both of these definitions have flaws, but they provide a fairly objective criterion for delimiting the set of items which any transcription scheme should be able to handle.</Paragraph>
    <Paragraph position="5"> The phenomena circumscribed by this definition are a subset of &amp;quot;vocal segregates&amp;quot; (Trager, 1958) and of &amp;quot;interjections&amp;quot;: the difference is that it limits attention to sounds occurring in conversations. This definition also roughly delimits the subset of &amp;quot;discourse markers&amp;quot; or &amp;quot;discourse particles&amp;quot; which occur in informal spoken discourse.</Paragraph>
    <Paragraph position="6"> As the phonetics and meanings of conversational grunts are currently not well understood, we have begun a project aiming to elucidate, model, and eventually exploit them.</Paragraph>
    <Paragraph position="7"> The current paper is a report on an approach 2 It may seem that the negative connotations of the word 'grunt' maire it inappropriate for use as a technical term, but the phenomenon itself is often stlgmatised, and so the term is appropriate in that sense too.</Paragraph>
    <Paragraph position="8"> STwo rules of thnmh were adopted to help in cases which were difllcult to judge: consider laughter as not conversational, and consider as conversational everything else that might possibly be playing some communicative role, even if it isn't clear what that role might be.</Paragraph>
    <Paragraph position="9">  to the preliminary problem of how to transcribe these sounds.</Paragraph>
    <Paragraph position="10"> A generally usable, standardized transcription scheme would be of great value. Immediate applications include screenplay writing and court recording. It would also facilitate the systematic corpns-based study of the meanings and functions of these sounds 4.</Paragraph>
    <Paragraph position="11"> There are also prospects for applications in systems. One could imagine a dialog transcription system that produces output with the grunts represented in enough detail to show whether a listener is being enthusiastic, reluctant, non-committal, bored, etc., as these states are often indicated by grunts rather than by words. One could imagine spoken dialog systems which prompt and confirm concisely with such grunts, instead of full words or phrases. And one could imagine spoken dialog systems which adjust their output based on barge-in feedback from the user such as uh-huh meaning &amp;quot;go on, don't talk so slow&amp;quot;, uh-hum meaning &amp;quot;stop, I need to think&amp;quot;, and ah meaning &amp;quot;I have something to say&amp;quot;.</Paragraph>
    <Paragraph position="12"> Section 2 surveys previous approaches to grunt transcription, Section 3 proposes a slightly new scheme, Section 4 discusses its adequacy, and Section 5 points out some open issues.</Paragraph>
  </Section>
  <Section position="3" start_page="29" end_page="32" type="metho">
    <SectionTitle>
2 Previous Schemes for Grunt
Transcription
</SectionTitle>
    <Paragraph position="0"> This section points out the problems with previous approaches to grunt translation.</Paragraph>
    <Section position="1" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
2.1 Phonetically Accurate Schemes
</SectionTitle>
      <Paragraph position="0"> One tradition in labeling grunts is to use a completely general scheme. The central inspiration here is the fact that grunts are unlike words, in that they contain sounds which are never seen in the lexical items of the language.</Paragraph>
      <Paragraph position="1"> As such, they can fall outside the coverage of even the International Phonetic Alphabet, which is only designed to handle those sounds 4This is not to say that there can be a strict ordering of activities here: on the contrary, it is not possible to fix a transcription standard without at least a tacit theory of the meanings and functions of the items being tra~ibed. Some thoughts on this appear elsewhere (Ward, 2000).</Paragraph>
      <Paragraph position="2"> which occur contrastively in some words in some language. Thus there have been proposals for richer, more complete transcription schemes, capable of handling just about any communicative noise that people have been observed to produce, including moans, cries and belches (Trager, 1958; Poyatos, 1975).</Paragraph>
      <Paragraph position="3"> One disadvantage of these notations is that they are not usable without training.</Paragraph>
      <Paragraph position="4"> A second disadvantage is that their generality is excessive for everyday use. As seen below, the vast majority of conversational grunts are drawn from a much smaller inventory of sounds.</Paragraph>
      <Paragraph position="5"> A third disadvantage is that they provide more accuracy than is needed. For example, in English there appear to be no grunts in which the difference between an alveolar nasal, a velar nasal, or nasalization of a vowel conveys a difference in meaning, and so these do not need to be distinguished in transcription. null</Paragraph>
    </Section>
    <Section position="2" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
2.2 A Functlon-based Schemes
</SectionTitle>
      <Paragraph position="0"> An alternative approach is seen in some schemes used for labeling corpora for purposes of training and evaluating speech recognizers. A quote from the most recent Switchboard labeling standard (Hamaker et al., 1998) gives the flavor: 20. Hesitation Sounds: Use &amp;quot;uh&amp;quot; or &amp;quot;ah&amp;quot; for hesitations consisting of a vowel sound, and &amp;quot;urn&amp;quot; or &amp;quot;hm&amp;quot; for hesitations with a nasal sound, depending upon which transcription the actual sound is closest to. Use &amp;quot;huh&amp;quot; for aspirated version of the hesitation as in &amp;quot;huh? &lt;other speaker responds&gt; um ok, I see your point.&amp;quot; 21: yes/no sounds: Use &amp;quot;uh-huh&amp;quot; or &amp;quot;um-hum&amp;quot; (yes) and &amp;quot;huh-uh&amp;quot; or &amp;quot;hum-tun&amp;quot; (no) for anything remotely resembling these sounds of assent or denial&amp;quot; Another scheme (Lander, 1996) lists several &amp;quot;miscellaneous words&amp;quot;, including:  &amp;quot;nuh uh&amp;quot; (no), &amp;quot;ram hmm&amp;quot; (yes), &amp;quot;hmm mmm&amp;quot; (no), 'hnm ram&amp;quot; (no), &amp;quot;uh huh&amp;quot; (yes), &amp;quot;huh uh&amp;quot; (no), &amp;quot;uh uh&amp;quot; (no) The inspiration behind these schemes seems to be the idea that grunts are just like words. This leads to two assumptions, both of which are questionable. First, there is the assumption that each grunt has some fixed meaning and some fixed functional role (filler, back-channel, etc). However, many specific grunt sounds can be found in more than one functional role, as seen in Table 1. Second, there is the assumption that the set of conversational grunts is small. However the number of observed grunts is not small~ as seen in Table 2, and the set of possible grunts is probably not even finite: for example, it would not be surprising at all to hear the sound hura-ha-har~ in conversation, or hem-ha-an, or hurn-ha-un, and so on, and so on. (However, not every possible sound seems likely to be a conversational grunt; for example ziflug would seem a surprising novelty, and would be downright weird in any of the functional positions typical for grunts.) One concrete problem with these schemes is that they are not designed to allow phonetically accurate representations of grunts 5. In particular, they make the task of the labeler a rather strange one. Given a grunt, first he must examine the context to determine whether it is a back-channel or a filler, then determine whether it sounds affirmative or negative, and only then can he consider what the actual sound is, and his options are limited to picking one of the labels in the functional/semantic category. The relation between the letters of the label and the phonetics of the grunt becomes somewhat arbitrary. This would be more tolerable if there was a clear tendency for each grunt to occur in only one functional position, but this is not the case, as noted above. The use of the aifirmative/negatlve distinction as a primary classificatory feature is also also open to question. In our corpus, only 1% of the grunts were negative in meaning, and these were all in contexts where a negative answer was expected or likely, so this distinction is a strange choice for a top-level dividing principle. Moreover, negative grunts are, in fact, characterized by two-syllables with a sharp syllable boundary, often a glottal stop, and/or a sharp downstep in pitch, and/or a lack of breathiness, but these features are reflected only tenuously in the spellings listed as possible for negative grunts in these schemes.</Paragraph>
    </Section>
    <Section position="3" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
2.3 Naive Transcription
</SectionTitle>
      <Paragraph position="0"> The third tradition in transcribing grunts is to allow labelers to just spell them in the 'usual' way, as one might see them written in the comics or in a detective novel. The inspiration behind this is that native speakers generally have had a lot of exposure to orthographic representations of grunts, and can be trusted to do the right thing.</Paragraph>
      <Paragraph position="1"> One problem with this tradition is that the mapping from letter sequences to the actual sounds is not clear. For example, a conversation transcription given as a textbook example of good practice includes &amp;quot;u&amp;quot; and &amp;quot;uh&amp;quot;, and &amp;quot;oh&amp;quot; and &amp;quot;oo&amp;quot; (Hutchby and Wooffitt, 1999), without footnoting. Presumably the %o&amp;quot; means /u/, but it could also possibly mean a version of &amp;quot;oh&amp;quot; with strong lip roundhag, or a longer form of &amp;quot;oh&amp;quot;, or perhaps a shorter form (if the labeler was trying to avoid confusion with the archaic vocative &amp;quot;o'). English orthography is phonetically ambiguous and not standardized for grunts.</Paragraph>
      <Paragraph position="2"> A second problem with this tradition is that creaky voice (vocal fry), although pragmatically significant, is generally not represented (although many practitioners are surprisingly diligent at noting occurrences of breathiness).</Paragraph>
    </Section>
    <Section position="4" start_page="30" end_page="32" type="sub_section">
      <SectionTitle>
2.4 Summary of Desiderata
</SectionTitle>
      <Paragraph position="0"> Ideally we want a scheme for transcribing grunts which I. is easy to learn and use, 5 Th.ls is acceptable if the only aim is to train speech recognizers, where the speech recognizers' acoustic models will end up capturing the possible phonetic variation without human intervention, and if the speech recognition results are not intended for actual use, but merely to be fed into an algorithm for COlputing recognition scores.</Paragraph>
      <Paragraph position="1">  occurring 2 or more times in our corpus</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"> 2. can represent all observed grunts, and 3. unambiguously represents all meaningful  differences in sound.</Paragraph>
      <Paragraph position="6"> While it is not possible to devise a single transcription scheme which is perfect for all purposes (Barry and Fourcin, 1992), it is clear that the current schemes all have room for improvement.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="32" end_page="32" type="metho">
    <SectionTitle>
3 Proposal
</SectionTitle>
    <Paragraph position="0"> The basic idea is to start with the naive transcription tradition and then tighten it up.</Paragraph>
    <Paragraph position="1"> The advantages of using this as a starting point are two. First, it's convenient, since it is ASCII, familiar, and requires no special training. Second, as the result of the cumulative result of many years of novelists' and cartoonists' efforts to represent dialog, it has presumably evolved to be fairly adequate for capturing those sounds variations which are significant to meaning.</Paragraph>
    <Paragraph position="2"> The biggest need is to clarify and regularize the mapping from transcription to sound. This is the primary contribution of this paper: a specification of the actual phonetic values of each of the letters commonly used in tran-Scribing conversational grunts, as follows: u means schwa. This causes no confusion because high vowels, including/u/, are vanishingly rare in conversational grunts.</Paragraph>
    <Paragraph position="3"> n generally means nasalization. This is unfamiliar in that English, unlike French, has no nasalized vowels in the words of the lexicon. However in grunts nasalization is common, as in ~n-hn and nyeah, and meaning-bearing. Occasionally there may be nasal consonants, and n can also be used for such cases, without confusion, because they appear to bear the same semantic value.</Paragraph>
    <Paragraph position="4"> h generally means breathiness. This often occurs at syllable boundaries, as in nh-huh.</Paragraph>
    <Paragraph position="5"> Some items involve breathiness throughout a syllable, others involve a consonantal/h/, while others seem ambiguous between these two.</Paragraph>
    <Paragraph position="6"> A single syllable-final 'h' bears no phonetic value.</Paragraph>
    <Paragraph position="7"> tsk indicates an alveolar tongue click. These occur often in isolation, and occasionally grunt-initially 6.</Paragraph>
    <Paragraph position="8"> - (hyphen) indicates a fairly strong syllable boundary. Phonetically this means a major dip in energy level, a sharp discontinuity in pitch, or a significant region of breathy or creaky voice.</Paragraph>
    <Paragraph position="9"> \[repetition\] Repetition of a letter indicates length and/or multiple weakly-separated syllables.</Paragraph>
    <Paragraph position="10"> uu as a syllable is a special case, indicating a creaky schwa All other letters have the normal values.</Paragraph>
    <Paragraph position="11"> There are two things that standard English orthography provides no way to express. These are expressed as annotations, following the basic transcription and separated from it by a comma.</Paragraph>
    <Paragraph position="12"> cr indicates creaky voice, as in yeah:er. For further precision numbers from 1 to 3 can be postposed, as in :crl for slightly creaky and :cr3 for extremely creaky.</Paragraph>
    <Paragraph position="13"> {nllrnhers~ numbers after a colon indicate anchor points for the pitch contour, on the standard 1 to 5 scale. Thus uhuh:~-22 is a negative response or warning, but uh-huh:43-22 is an blatantly uninterested back-channel, and uh-huh:3234 is the standard, polite back-channeL Table 3 summarizes these letter-sound mappings. Table 4 suggests which sounds are most common.</Paragraph>
  </Section>
  <Section position="5" start_page="32" end_page="34" type="metho">
    <SectionTitle>
4 Adequacy
</SectionTitle>
    <Paragraph position="0"> This scheme does fairly well by the criteria of SS2.4.</Paragraph>
    <Paragraph position="1"> degThere are cases where the click is followed by a voiced sound without any perceptible pause (with a delay from the onset of the click to the onset of voicing of 50 to 170 milliseconds).</Paragraph>
    <Paragraph position="2">  notation \[ p\]~onetic value non-trivial mappings h a single syllable-final 'h' bears no phonetic value, elsewhere 'h' indicates/h/or breathiness nasalization, occasionally a nasal consonant (other than/m/) tsk alveolar tongue click u ~ (schwa) repetition of a letter length and/or multiple weakly-separated syllables - (hyphen) a fairly strong boundary between syllables or words standard mappings common in grunts m /m/ o /o/ a /a/ y /jl, as in yeah and variants idiosyncratic spellings yeah /je~/ kay /keI/, as in okay, ukay, llnkay, mkay etc. uu as a syllable, indicates a short creaky or glottalized schwa  which include the various sound components 1. As far as clarity and usability, this  scheme has a direct and simple mapping from representation to the actual phonetics. It has been trivial to learn and easy to use (at least for the author; other labelers have not yet been trained).</Paragraph>
    <Paragraph position="3"> 2. As far as representational coverage, this scheme is adequate for some 97% (=306/317) of the grunts which occur in our corpus. Thus it is not truly complete, and labelers must be allowed to escape into standard lexical orthography (for things like oop-ep-oop and wow), into IPA (for eases like achh and yegh, palatal and velar fricatives, respectively), and into ad hoc notion (for cases like throat clearings and noisy exhalations).</Paragraph>
    <Paragraph position="4"> 3. As far as precision, the scheme allows sumciently detailed representation; at least to a first appro~mation. In particular, it covers all known meaningful phonetic variations. It is, however possible that other phonetic distinctions are also significant. For example, it may be that the exact height of a vowel  matters, or the exact time point at which a vowel starts getting creaky, or the presence of glottal stops, lip rounding, glottalization, falsetto, and so on matter, or the precise details of pitch and energy contours matter. Conversely, the scheme is not over-precise: all the phonetic elements represented in the scheme appear to bear meanings (Ward, 2000).</Paragraph>
    <Paragraph position="5"> Regarding unambignity, the scheme is an improvement but has one failing: repetition of a letter represents either extended duration or the presence of multiple syllables. As these two phonetic features are generally correlated, and the difference in meaning between them is anyway subtle, this may not be a major problem.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML