File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1015_metho.xml
Size: 23,388 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1015"> <Title>Prosodic Aids to Syntactic and Semantic Analysis of Spoken English</Title> <Section position="4" start_page="112" end_page="112" type="metho"> <SectionTitle> 2. SYSTEM OVERVIEW </SectionTitle> <Paragraph position="0"> Our work is aimed at the construction of a prototype system for the understanding of spoken requests to an electronic directory assistance service, such as finding the phone number and address of a local business that offers particular services.</Paragraph> <Paragraph position="1"> Our immediate work does not concentrate on speech recognition (SR) or lexical access. Instead, we assume that a future speech recognition system performs phoneme recognition and uses linguistic information during word recognition. Recognition is supplemented by a prosodic feature extractor, which produces features synchronized to the word string output by the SR.</Paragraph> <Paragraph position="2"> The output of the recognizer is passed to a sentence-level parser. Here =sentence&quot; really means a conversational move, that is, a contiguous utterance of words constructed so as to convey a proposition.</Paragraph> <Paragraph position="3"> Parses of conversational moves are passed to a dialogue analyzer that segments the dialogue into contextually-consistent sub-dialogues (i.e, exchanges) and interpret speaker requests in terms of available system functions. A dialogue manager manages interaction with the speaker and retrieves database information,</Paragraph> </Section> <Section position="5" start_page="112" end_page="113" type="metho"> <SectionTitle> 3. PROSODY EXTRACTION </SectionTitle> <Paragraph position="0"> As the input to the parser is spoken language, it lacks the segmentation apparent in text. Within a move, there is no punctuation to hint at internal grammatical .structure. In addition, as complete sentences are frequently reduced to phrases, ellipsis etc. during a dialogue, the Parser cannot use syntax alone for segmentation.</Paragraph> <Paragraph position="1"> Although intonation reflects deeper issues, such as a speakers' intended interpretation, it provides the surface structure for spoken language. Intonation is inherently supra-segmental, but it is also useful for segmentation purposes where other information is unavailable. Thus, intonation can be used to provide initial segmentation via a pre-processor for the parser.</Paragraph> <Paragraph position="2"> Although there are many prosodic features that are potentially useful in the understanding of spoken English, pitch and pause information have received the most attention due to ease of measurement and their relative importance (Cruttenden 1986, pp 3 & 36). Our efforts to date use only these two feature types.</Paragraph> <Paragraph position="3"> We extract pitch and pause information from speech using specifically designed hardware with some software post-processing. The hardware performs frequency to amplitude transformation and filtering to produce an approximate pitch contour with pauses.</Paragraph> <Paragraph position="4"> The post-processing samples the pitch contour, determines the pitch range and classifies the instantaneous pitch into high, medium and low categories within that range. This is similar to that used in (Hirschberg & Pierrehumbert 1986).</Paragraph> <Paragraph position="5"> Pauses are classed as short (less than 250ms), long (between 250ms and 800ms) or extended (greater than 800ms). These times were empirically derived from spoken information seeking dialogues conducted over a telephone to human operators. Short pauses signify strong tum-holding behaviour, long pauses signify weaker turn-holding behaviour and extended pauses signify turn passing or exchange completion (Vonwiller 1991). These interpretations can vary with certain pitch movements, however. Unvoiced sounds are distinguished from pauses by subsequent synchronisation of prosodic features with the word stream by post-processing.</Paragraph> <Paragraph position="6"> A parser pre-processor then takes the SR word string, pitch markers and pauses, annotating the word string with pitch markers (low marked as = ~ &quot;, medium = - &quot;and high = ^ &quot;) and pauses (short .... and long ..... ). The markers are synchronised with words or syllables. The pre-processor uses the pitch and pause markers to segment the word string into intonationallyconsistent groups, such as tone groups (boundaries marked as = < = and &quot;> &quot;) and moves (//). A tone group is a group of words whose intonational structure indicates that they form a major structural component of the speech, which is commonly also a major syntactic grouping (Cruttenden 1986, pp. 75 - 80). Short conversational moves often correspond to tone groups, while longer moves may consist of several tone groups. With cue words for example, the cue forms its own tone group.</Paragraph> <Paragraph position="7"> Pauses usually occur at points of low transitional probability and often mark phrase boundaries (Cruttenden 1986). In general, although pitch plays an important part, long pauses, indicate tone group and move boundaries, and short pauses indicate tone group boundaries. Exchange boundary markers are dealt with in the dialogue manager (not covered here). Pitch movements indicate turn-holding behaviour, topic changes, move completion and information contrastiveness (Cooper & Sorensen 1977; Vonwilier 1991).</Paragraph> <Paragraph position="8"> The pre-processor also locates fixed expressions, so that during the parsing nondeterminism can be reduced. A problem here is that a cluster of words may be ambiguous in terms of whether they form a fixed expression or not. &quot;Look after&quot;, for example, means =take care of&quot; in &quot;Mary helped John to look after his kid#', whereas &quot;look&quot; and &quot;after&quot; have separate meaning in &quot;rll look after you do so&quot;. The pre-processor makes use of tone group information to help resolve the fixed expression ambiguity. A more detailed discussion is given in section 5.2.</Paragraph> </Section> <Section position="6" start_page="113" end_page="113" type="metho"> <SectionTitle> 4. THE PARSER </SectionTitle> <Paragraph position="0"> Once the input is segmented, moves annotated with prosody are input to the parser. The parser deals with one move at a time.</Paragraph> <Paragraph position="1"> In general, the intonational structure of a sentence and its syntactic structure coincide (Cruttenden 1986). Thus, prosodic segmentation avoids having the Parser try to extract moves from unsegmented word strings based solely on syntax. It also reduces the computational complexity in comparing syntactic and prosodic word groupings. There is a complication, however, in that tone group boundaries and move boundaries may not align exactly. This is not frequent, and is not present in the material used here. Intonation is used to limit the range of syntactic possibilities and the parser will align tone group and move syntactic boundaries at a later stage.</Paragraph> <Paragraph position="2"> By integrating syntax and semantics, the Parser is capable of resolving most of the ambiguous structures it encounters in parsing written English sentences, such as coordinate conjunctions, PP attachments, and lexical ambiguity (Huang 1988). Migrating the Parser from written to spoken English is our current focus.</Paragraph> <Paragraph position="3"> Moves input to the Parser are unlikely to be well-formed sentences, as people do not always speak grammatically, or due to the SR's inability to accurately recognise the actual words spoken.</Paragraph> <Paragraph position="4"> The parser first assumes that the input move is lexically correct and tries to obtain a parse for it, employing syntactic and semantic relaxation techniques for handling ill-formed sentences (Huang 1988). If no acceptable analysis is produced, the parser asks the SR to provide the next alternative word string.</Paragraph> <Paragraph position="5"> Exchanges between the parser and the SR are needed for handling situations where an ill-formed utterance gets further distorted by the SR. In these cases other knowledge sources such as pragmatics, dialogue analysis, and dialogue management must be used to find the most likely interpretation for the input string. We use pragmatics and knowledge of dialogue structure to find the semantic links between separate conversational moves by either participant and resolve indirectness such as pronouns, deictic expressions and brief responses to the other speaker \[for more details, see (Rowles, 1989)\].</Paragraph> <Paragraph position="6"> By determining the dialogue purpose of utterances and their domain context, it is then possible to correct some of the insertion and mis-recognised word errors from the SR and determine the communicative intent of the speaker. The dialogue manager queries the speaker if sentences cannot be analysed at the pragmatic stage.</Paragraph> <Paragraph position="7"> The output of the parser is a parse tree that contains syntactic, semantic and prosodic features. Most ambiguity is removed in the parse tree, though some is left for later resolution, such as definite and anaphoric references, whose resolution normally requires inter-move inferences.</Paragraph> <Paragraph position="8"> The parser also detects cue words in its input using prosody. Cue words, such as &quot;now&quot; in &quot;Now, I want to...&quot;, are words whose meta-function in determining the structure of dialogues overrides their semantic roles (Reichman 1985).Cue words and phrases are prosodically distinct due to their high pitch and pause separation from tone groups that convey most of the propositional content (Hirschberg & Litman 1987). While relatively unimportant semantically, cue words are very important in dialogue analysis due to their ability to indicate segmentation and the linkage of the dialogue components.</Paragraph> </Section> <Section position="7" start_page="113" end_page="113" type="metho"> <SectionTitle> 5. PROSODY AND DISAMBIGUATION </SectionTitle> <Paragraph position="0"> During parsing prosodic information is used to help disambiguate certain structures which cannot be disambiguated syntactically/semantically, or whose processing demands extra efforts, if no such prosodic information is available.</Paragraph> <Paragraph position="1"> In general, prosody includes pitch, loudness, duration (of words, morphemes and pauses) and rhythm. While all of these are important cues, we are currently focussing on pitch and pauses as these are easily extracted from the waveform and offer useful disambiguation during parsing and segmentation in dialogue analysis. Subsequent work will include the other features, and further refinement of the use of pitch and pause.</Paragraph> <Paragraph position="2"> At present, for example, we do not consider the length of pauses internal to tone groups, although this may be significant.</Paragraph> <Paragraph position="3"> The prosodic markers are used by the parser as additional pre-conditions for grammatical rules, discriminating between possible grammatical constructions via consistent intonational structures.</Paragraph> <Section position="1" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 5.1 HOMOGRAPHS </SectionTitle> <Paragraph position="0"> Even when using prosody, homographs are a problem for parsers, although a system recognising words from phonemes can make the problem a simpler. The word sense of =bank&quot; in &quot;John went to the bank&quot; must be determined from semantics as the sense is not dependent upon vocalisation, but the difference between the homograph &quot;content&quot; in &quot;contents of a book&quot; and &quot;happy and content' can be determined through differing syllabic stress and resultant different phonemes. Thus, different homographs can be detected during lexical access in the SR independently of the Parser.</Paragraph> </Section> <Section position="2" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 5.2 FIXED EXPRESSIONS </SectionTitle> <Paragraph position="0"> As is mentioned in subsection 4.1, when the pre-processor tries to locate fixed expressions, it may face multiple choices. Some fixed expressions are obligatory, i.e., they form single semantic units, for instance =look forward to&quot; often means &quot;expect to feel pleasure in (something about to happen) ''2. Some other strings may or</Paragraph> </Section> </Section> <Section position="8" start_page="113" end_page="116" type="metho"> <SectionTitle> 2. Longman Dictionary of Contemporary En- </SectionTitle> <Paragraph position="0"> glish, 1978.</Paragraph> <Paragraph position="1"> may not form single sematic units, depending on the context. =Look after&quot; and &quot;win over&quot; are two examples. Without prosodic information, the pre-processor has to make a choice blindly, e.g. treating all potential fixed expressions as such and on backtracking dissolve them into separate words. This adds to the nondeterminism of the parsing. As prosodic information becomes available, the nondeterminism is avoided.</Paragraph> <Paragraph position="2"> In the system's fixed expression lexicon, we have entries such as &quot;fix_e(\[gave, up\], gave_up)&quot;. The pre-processor contains a rule to the following effect, which conjoins two (or more) words into one fixed expression only when there is no pause following the first word: In (5.1a), gave and upto are treated as belonging to two separate tone groups, whereas in (5.1 b) gave up is marked as one tone group. The pre-processor checking its fixed expression dictionary will therefore convert up to in (5.1 a) to up_to, and gave up in (5.1b) to gave_up.</Paragraph> <Section position="1" start_page="113" end_page="115" type="sub_section"> <SectionTitle> 5.3 PP ATTACHMENT </SectionTitle> <Paragraph position="0"> (Steedman 1990 & Cruttenden 1986) observed that intonational structure is strongly constrained by meaning. For example, an intonation imposing bracketings like the following is not allowed: null (5.2) <Three cats> <in ten prefer corduroy>// Conversely, the actual contour detected for the input can be significant in helping decide the segmentation and resolving PP attachment. In the following sentence, f.g., (5.3) <1 would like> < information on her arrival> \[=on her arrival&quot; attached to &quot;information' 1 (5.4) <1 would like> <information> ** <on her arrival> \[&quot;on her arrival&quot; attached to &quot;like&quot;\] the pause after &quot;information&quot; in (5.4), but not in (5.3), breaks the bracketed phrase in (5.3) into two separate tone groups with different attachments. null In a clash between prosodic constraints and syntactic/semantic constraints, the latter takes precedence over the former. For instance, in: (5.5) <1 would like> <information> ** <on some panel beaters in my area>.</Paragraph> <Paragraph position="1"> although the intonation does not suggest attachment of the PP to &quot;information&quot;, since the semantics constraints exclude attachment to &quot;like&quot; meaning &quot;choose to have&quot; (&quot;On panel beaters \[as a location or time\] I like information&quot; does not rate as a good interpretation), it is attached to &quot;information&quot; anyway (which satisfies the syntactic/ semantic constraints).</Paragraph> </Section> <Section position="2" start_page="115" end_page="116" type="sub_section"> <SectionTitle> 5.4 COORDINATE CONSTRUCTIONS </SectionTitle> <Paragraph position="0"> Coordinate constructions can be highly ambiguous, and are handled by rules such as: Np --> det(Det), adj(Adj), /* check if a pause follows the adjective */ {check_pause (Flag)}, noun (Noun), {construct_np(Det, Adj, Noun, NP}, conjunction(NP, Flag, FinalNP).</Paragraph> <Paragraph position="1"> In the conjunction rule, if two noun phrases are joined, we check for any pauses to see if the adjective modifying the first noun should be copied to allow it to modify the second noun. Similarly, we check for a pause preceding the conjunction to decide if we should copy the post modifier of the second noun to the first noun phrase. For instance, the text-form phrase: (5.6) old men and women in glasses can produce three possible interpretations: \[old men (in glasses)\] and \[(old) women in glasses\] (5.6a) \[old men\] and \[women in glasses\] (5.6b) \[old men (in glasses)\] and \[women in glasses\] distinguished in the software post-processor. In all waveforms &quot;old&quot; and &quot;glasses&quot; have high pitch. In (5.6a), a short pause follows &quot;old&quot;, indicating that &quot;old&quot; modifies &quot;men and women in glasses&quot; as a sub-phrase. This is in contrast to (5.6b) where the short pause appears after &quot;men&quot; indicating &quot;old men&quot; as one conjunct and &quot;women in glasses&quot; as the other. Notice also that duration of &quot;men&quot; in (5.6b) is longer than in (5.6a). In (5.6c) we have two major pauses, a shorter one after &quot;men&quot; and a longer one after &quot;women&quot;. Using this variation in pause locations, the parser produces the correct interpretation (i.e. the speaker's intended interpretation) for sentences (5.6a-c).</Paragraph> </Section> </Section> <Section position="9" start_page="116" end_page="117" type="metho"> <SectionTitle> 6. IMPLEMENTATION </SectionTitle> <Paragraph position="0"> Prosodic information, currently the pitch contour and pauses, are extracted by hardware and software. The hardware detects pitch and pauses from the speech waveform, while the software determines the duration of pauses, categorises pitch movements and synchronises these to the sequence of lexical tokens output from a hypothetical word recogniser. The parser is written in the Definite Clause Grammars formalism (Pereira et al. 1980) and runs under BIMProlog on a SPARCstation 1. The pitch and pause extractor as described here is also complete.</Paragraph> <Paragraph position="1"> To illustrate the function of the prosodic feature extractor and the Parser pre-processor, the following sentence was uttered and its pitch contour analysed: &quot;yes i'd like information on some panel beaters&quot; Prosodic feature extraction produced: ** Ayes ** ^i'd Alike * -information on some ^panel beaters **// The Parser pre-processor then segments the input (in terms of moves and tone groups) for the Parser, resulting in: **< Ayes> **//< ^i'd Alike> * <-information on some ^panel beaters> **// The actual output of the pre-processor is in two parts, one an indexed string of lexical items plus prosodic information, the other a string of tone groups indicating their start and end points: \[** Ayes, 1\] \[**// ^i, 2\] \[would, 3\] \[Alike, 4\] \[* -information, 5\] \[on, 6\] \[some, 7\] \[&quot;panel_ beaters, 8\] \[**//, 9\] <1,1> <2, 4> < 5, 8> <9,9> We use a set of sentences 3, all beginning with &quot;Before the King~feature race~', but with different intonation to provide different interpretations, to illustrate how syntax, semantics and 3. Adapted from (Briscoe & Boguraev 1984).</Paragraph> <Paragraph position="2"> prosody (6.1) *horse> are used for disambiguation: <~ Before the -King ^races>*<-his <is -usually ^groomed>**//.</Paragraph> <Paragraph position="3"> (6.2) <~Before the -King> *<-races his ^horse> **<it's -usually ^groomed>**//.</Paragraph> <Paragraph position="4"> (6.3) <~Before the ^feature ~races> *<-his ^horse is -usually ^groomed>**//.</Paragraph> <Paragraph position="5"> The syntactic ambiguity of &quot;before&quot; (preposition in 6.3 and subordinate conjunction in 6.1 and 6.2) is solved by semantic checking: &quot;race&quot; as a verb requires an animate subject, which &quot;the King&quot; satisfies, but not &quot;the feature&quot;; &quot;race&quot; as a noun can normally be modified by other nouns such as &quot;feature&quot;, but not &quot;King '4. However, when prosody information is not used the time needed for parsing the three sentences varies tremendously, due to the top-down, depth-first nature of the parser. (6.3) took 2.05 seconds to parse, whereas (6.1) took 9.34 seconds, and (6.2), 41.78 seconds. The explanation lies in that on seeing the word &quot;before&quot; the parser made an assumption that it was a preposition (correct for 6.3), and took the &quot;wrong&quot; path before backtracking to find that it really was a conjunction (for 6.1 and 6.2). Changingthe order of rules would not help here: if the first assumption treats &quot;before&quot; as a conjunction, then parsing of (6.3) would have been slowed down.</Paragraph> <Paragraph position="6"> We made one change to the grammar so that it takes into account the pitch information accompanying the word &quot;races&quot; to see if improvement can be made. The parser states that a noun-noun string can form a compound noun group only when the last noun has a low pitch. That is, the feature ~races forms a legitimate noun phrase, while the King -races and the King '~races do not. This is in accordance with one of the best known English stress rules, the &quot;Compound Stress Rule&quot; (Chomsky and Halle 1968), which asserts that the first lexically stressed syllable in a constituent has the primary stress if the constituent is a compound construction forming an adjective, verb, or noun.</Paragraph> <Paragraph position="7"> 4. It is very difficult, though, to give a clear cut as to what kind of nouns can function as noun modifiers. King races may be a perfect noun group in certain context.</Paragraph> <Paragraph position="8"> We then added the pause information in the parser along similar lines. The following is a simplified version of the VP grammar to illustrate the parsing mechanism: /* Noun phrase rule.</Paragraph> <Paragraph position="9"> &quot;Mods&quot; can be a string of adjectives or nouns: major (races), feature (races), etc.*/ The pause information following &quot;races&quot; in sentences(6.1) and (6.2)thus helps the parser to decide if &quot;races&quot; is transitive or intransitive, again reducing nondeterminism. The above rules specify only the preferred patterns, not absolute constraints. If they cannot be satisfied, e.g. when there is no pause detected after a verb which is intransitive, the string is accepted anyway.</Paragraph> <Paragraph position="10"> The parse times for sentences (6.1) to (6.3) with and without prosodic rules in the parser are given in the Table 6.1.</Paragraph> <Paragraph position="11"> While (6.6) is slower with prosodic annotation, the parser correctly recognises &quot;now&quot; as a cue word rather than as an adverb.</Paragraph> </Section> class="xml-element"></Paper>