File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2133_metho.xml
Size: 17,419 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2133"> <Title>Prosody and the Resolution of Pronominal Anaphora</Title> <Section position="4" start_page="919" end_page="919" type="metho"> <SectionTitle> 3 The Corpus: TRAINS93 </SectionTitle> <Paragraph position="0"> Our data is taken from the TRAINS93 corpus of hunlunhuman problem solving dialogs in the logistics phnuting domain. In these dialogs, one participant plays the role of the planning assistant and the other attempts to construct a plan for delivering specified cargo to its destination. We used a subset of 18 TRAINS93 dialogs in which the referent and antecedent of third-person non-gendcrcd pronouns I had been attnotated in a previous study (Byron and Allen, 1998). In the dialogs used for the present study, 322 pronouns (158 personal and 164 demonstralive) have been annotated. Personal pronouns ill the dialogs are it, its, itselJ; them, the3,, their and themselves.</Paragraph> <Paragraph position="1"> Demonstrative pronouns in the annotation data are that, this, these, those. There are live nmle and 11 fenmle speakers. One female speaker contributed 89 pronouns, two others produced more than 30 each (one female, one male), the rest is divided unevenly among tile remaining 13 speakers. The set of dialogs chosen for annotation intentionally included a variety of speakers so that no speaker's idiosyncratic discourse strategies would be prevalent ill the resulting data.</Paragraph> <Paragraph position="2"> Table 1 describes the attributes caplurcd for each pronoun. These features were chosen for tile annotation because many previous studies have shown them to be imporlant for pronoun resolution. Features illclude attributes of the pronoun, its antecedent (the discotu'se constituent Ihat previously triggered lhe referent), and its referent (the entity that should be substituted for the pronoun in a semantic representation of the sentence). Cb was annotated using Model3 from (Byron and Stent, 1998) with a linear model of discourse structure. Note that annolaled prononns were not limited to those with NP antecedents, as is tile case with most other studies. In addition to NP antecedents, pronouns in this data set could have an antecedent of some other phrase or clause type, or no annomtablc antecedent at all. There are two categories of pronouns with no annotalable antecedent. Ill the simplest case, tim pronominal reference is the first mention of the referent ill tile dialog. That happens when the referent is inferred liom the problem solving state. For example, after&quot; tile utterance send the engine to Coming and pick up the boxcars, a new discourse en-I No gendcred entities exist in this co,'pus, so gendered pronouns wc,-c not inchtdcd. All dcmonst,'ativc pronouns were annolated; howevcf, lhcre were only 5 occurrences of &quot;this&quot; in the selected dialogs, so eonstrasts between proxinml and distal dcmonslratives could not be studied.</Paragraph> <Paragraph position="3"> def= tile pronoun is one of {it, its, itself, them, dmy, thcin themselves} dcm = the inonoun is one of {that, this, these, fllose}</Paragraph> <Paragraph position="5"/> </Section> <Section position="5" start_page="919" end_page="919" type="metho"> <SectionTitle> ANTE ANTESUBJ </SectionTitle> <Paragraph position="0"> NP/pmn. non-NP none yes no same 75.9% 6.3% 17.8 % 37.3% 62.7% 29.1% 28.0% 36.1t0% 36.0% 14.0% 86.11% 18.9% 51.60{, 21.4% 27.0% 25.5% 74.5% 23.9% personal 33.5% 20.2% demonslrafive 29.9% 15.2% lolal 31.7% 17.7% are given relative to tile lolal ntnnber of pronouns in that category and rounded. Boldface: most frequent antecedent property.</Paragraph> <Paragraph position="1"> tity, tile train composed of tile engine and Ix)xcars, is awfilable for anaphoric reference. In the more subtle case, Ihe entity was built from a stretch (51&quot; discourse longer than one utterance. In an effort to achieve an acceptable level of inier-annotalor agreelnenl for the aw nohltion, the maxinmm size \[or a consfiluenl to serve as ~tll ~ltllecedelll W\[lS de\[illed l(1 be OllC ullCl'~,lllCC, l)iscourse entities that are built fi'om longer she/chcs of lexl include objects such as tile entire 131an or tile discourse itself, and such items are less reliable lo annotate.</Paragraph> <Paragraph position="2"> qaking the annotated dialogs as a whole, 21.4% of all prollouns have ;.l non-NP antecedent, and 27% do not have an mmolatal~le antecedent at a11. qhble 2 shows thal tile default antecedenls o1' personal and denlonsh'alive pronouns follow the predictions of Schiffman (1985).</Paragraph> <Paragraph position="3"> The antecedent of personal pronouns is most likely itself lo be a pronoun or a full NP, while demonstratives m'e most likely to have no antecedent, or if there is one, it is ntost likely to be a non-NR The main role of prosodic illlksr,nation is to help pronoun resolution algorithms identify cases where flmse default predictions are false.</Paragraph> </Section> <Section position="6" start_page="919" end_page="921" type="metho"> <SectionTitle> 4 Acoustic Prosodic Cues </SectionTitle> <Paragraph position="0"> Our selection (51' acottstic measures covers three classic components of prosody: fundamental frequency (IV()), duration, and intensity (Lehiste, 1970). The relationship between those cues and prosodic pronlinencc has been demonstrated by e.g. (Fant and Kruckenberg, 1989; Heufl, 1999). Tile main correlate of English stress is F0, the second rues! imporlant is duration, and the least imporlanl is inlensity (1,chisle, 1970). Therefore, we will pay more allelllioll lo F0 illeflsUl'eS. Although cxperimenial results indicate flint 1;0 cues of pronlinencc can depend on the shape of file 1:0 conlour of the uucranec (c.f. (Gussenhoven cl al., 1997)), we do nol control for such illleraclions. \]llstead, we reslricl ourselves to cues that are easy to COnlpute fr(ml limiled dala, so that a running spoken dialogue system might be able to compute them in real time.</Paragraph> <Section position="1" start_page="919" end_page="921" type="sub_section"> <SectionTitle> 4.1 Acoustic Measures </SectionTitle> <Paragraph position="0"> Duration: For duration, we found lhat 1he logarithmic duration wllues a,'c nornmlly distributed, bolh pooled over all speakers and for lhoso speakers willl more than 20 pronouns. Logariflmtic duration is also tile target variable of many duration models such as that of (van Santen, 1992). We assume that speaker-related variation is covered by the w,'iance of lhis normal distribution; we can control for speaker effects by including a SPEAKER factor in our models.</Paragraph> <Paragraph position="1"> F0 variables: F0 was computed using the \]2ntropic ESPS Waves tool get_f0 with standard settings and a frame rate (51' 10 ms. All F0 wdues were transt'onned into lhe log-domain and then pooled imo mean, minimum, and maximum F0 values for each word and each utterance. This log donmin is well motiw~led psychoacoustically (Zwicker and lhtstl, 1990). F0 range was computed oil the values in tile log-domain. We assume lhat the Iogm'ithm of F0 has a nomml distribution. Therefore, we can nommlize for speaker-dependent differences in pitch range by using z-scores, and we can use standard statistical analysis methods such as ANOVA.</Paragraph> <Paragraph position="2"> Intensity: Intensity is measured as the root-meansquare (RMS) of signal amplitudes. We measure RMS relative to a baseline as given by the formula log(l{MS/RMSb~olino). The baseline RMS was computed on the basis of a simple pause detection algorithm, which takes the first nmximum in the amplitude histogram to be the average amplitude of background noise. The baseline RMS was slightly above that value.</Paragraph> </Section> <Section position="2" start_page="921" end_page="921" type="sub_section"> <SectionTitle> 4.2 Inter-Speaker Differences </SectionTitle> <Paragraph position="0"> Since we need to pool data from many different speakers, we qeed to control for inter-speaker differences.</Paragraph> <Paragraph position="1"> Tim number of pronouns we have fl'om each speaker varies between 1 for speaker GD and 86 for speaker CK. Speakers PH, male, and CK, female, are the only ones to lmve produced more than 15 personal pronouns and 15 demonstratives. In order to test whether the SPEAKER factor affects the choice between personal pronouns and demonstratives, we titted a logistic regression model with the target variable PRONTYPE (personal or demonstrative) and the predictorsANTE, ANTESUBJ, DIST, REFCAT, CBand SPEAKER (in this sequence). REFCAT is an additional variable that describes the senmntic category of a pronoun's referent (eg. donmin objects vs. abstract entities). Even though SPEAKER is the last factor in the model, an analysis of deviance shows a signilicant intlueuce (p<0.005,F=2.51,df13). A possible explanation for this is that some speakers prefer to use demonstratives in contexts where others would choose a personal pronotm, and vice versa, or perhaps the SPEAKER variable mediates the intluence of a far ,nore complex factor such as problem solving strategy. Resolving this queslion is beyond the scope of this paper.</Paragraph> <Paragraph position="2"> On the basis of F0, we can establish four groups of speakers: The first group consists of male speakers with a low mean F0 and a low F0 range. In the next group, we find both male and female speakers with a low mean F0, but a far higher range. Speaker PH belongs to this second group. Interestingly, for these speakers, the mean F0 on pronouns is lower titan for those of the first group. Groups 3 and 4 consist entirely of female speakers, with group 3 using a lower range than group 4. Speaker CK belongs to group 4.</Paragraph> </Section> </Section> <Section position="7" start_page="921" end_page="923" type="metho"> <SectionTitle> 5 Exploring Prominent Pronouns </SectionTitle> <Paragraph position="0"> If data about prosodic prominence is to be useful for pronoun resolution, then there must be prosodic cues that carry information about properties of the antecedent. In this section, we investigate if there are such cues for the properties that we have available in the annotation data, defined in ~lable 1. More specitieally, we hypothesize that prosodic cues will be used if the antecedent is somewhat unusual. For example, the results of Linde and ties (p <0.05) on Prosodic Cues. inean=z-score mean F0, range=range of z-score F0, dur=logarithmic duration, dem=demonstratives, pets=personal pronotms Passonneau would lead us to expect that personal pronouns with non-NP antecedents and demonstratives with NP and pronoun antecedents will be marked. Since the antecedents of pronouns tend to occur no more than 1-2 clauses ago, we would also expect pronouns with more remote antecedents to be marked. A first qualitative look at the data suggets that even il' such these tendencies are present in the data, they might not turn out to be significant. For example, in Figure 1, the means of lzmeanf0 behave roughly as predicted, but the variation is so large that these differences might well be due to chance.</Paragraph> <Section position="1" start_page="921" end_page="922" type="sub_section"> <SectionTitle> 5.1 Correlations between Measures and Properties </SectionTitle> <Paragraph position="0"> Next, we examine whether the measures delined in Seclion 4 correlate with any particular properties o1' the antecedent. More precisely, if a property is cued by some aspect ot' prosody (either duration, F0, or intensity), then the prosody of a pronoun depends to a cerlain degree on its antecedent. In a statistical analysis, we should lind a significant effect of the relevant antecedent property on the prosodic measure. We selected ANOVA as our analysis method, because our prosodic target variables appear to have a normal distribution. For each of the antecedent features delined above, we examined its inlluence on mean F0 (imeanf0), the z-score of mean F0 (lzmeanf0), the z-score of F0 range (lzrgf0), logarithmic duration (dur), and normalized energy (energy). In addition, we added the tactors, PRONTYPE and SPEAKER.</Paragraph> <Paragraph position="1"> Results: The results are summarized in Table 3. For izmeanf0 and energy, the influence of SPEAKER is always considerable. There are also consistent effects of the syntactic position of a pronoun: In general, demonstratives are shorter in subject position, and for CK, mean F0 on personal pronouns in subject position is higher than on non-subject ones (228 Hz vs. 190 Hz).</Paragraph> <Paragraph position="2"> But when we turn to the factors that interest us lnOSt, properties of the antecedent, we cannot lind any consistent correlates, although in ahnost every data set, there are some prosodic cues to ANTESUBJ for personal pronouns. But what these cues are may well depend on the speaker, as the results for CK show. Her pitch range on pronouns with a stdjcct antecedent is double the range on pronouns with an antecedent in non-su/lject position.</Paragraph> <Paragraph position="4"> COUllt for a very small percelltage of tile wtriatioll in these prosodic cues. Therefore, we should i~ot expect the prosodic cues to be slablc, robust indicators for predictins antecedent properlies ill spoken dialog systems.</Paragraph> </Section> <Section position="2" start_page="922" end_page="922" type="sub_section"> <SectionTitle> 5.2 Inter-Speaker Variation </SectionTitle> <Paragraph position="0"> we have sccn that inter-speaker di ffcrcl~ces cxpl;~i n much of the variation in the prosodic measures. Table 4 gives an idea of the size and direction of these differences.</Paragraph> <Paragraph position="1"> On the complete data set, wc lilKl that personal pronouns are shorlor lhan demonslratives, they have a lower intensity and show a higher average 1;0 (3~tble 4). A closer examination reveals considerable inter-speaker variation in the data, illustrated in Table 4. CK is fairly ptototypical. PH barely shows the difference il~ F0, al~d for MF, the difference in intensity is actually reversed.</Paragraph> <Paragraph position="2"> MF also has rather shor! demonstratives. Such speakerspecilic wlriation callnot be eliminated by nomtalization. It has to be controlled for in the statistical lcsls. Discovering types of speakers is diflicult - two of the 15 speakers, CK, and PH, con/ribute 48% of all pronouns.</Paragraph> </Section> <Section position="3" start_page="922" end_page="923" type="sub_section"> <SectionTitle> 5.3 Predicting Properties of tile Antecedent </SectionTitle> <Paragraph position="0"> Finally, we examine how much information prosodic cues yield about the ~tntecedent. For this purpose, we set till a prediction lask not unlike one that all actual NLU syslenl l~lces. The input variables arc the prosodic properties of the pronoun, whether the protloun is personal or demonstrative (P\]R.ONTYPE), whether it is the subject (PRONSUBJ), and whether it is sentence-initial (PRONZNIT). From this, we now have to deduce l~roper lies of thc antecedent: syntactic i'olc (ANTESrdBJ), fern1 (ANTEFORM), and distance (DZST). For prediction, wc used logistic regression (Agresti, 1990). This has two advantages: not only can wc compare how well the differcnt regression models lit the data, wc call also re-analyze the titled model to determine which factors have a significant inlluence oll classiIication accuracy.</Paragraph> <Paragraph position="1"> Firsl, we conslrucl a model on the basis of PRONTYPE, PRONSUBJ, and PRONINIT. Then, we conslruct a model with these three faclors plus SPEAKER.. finally, we train a model with PRONTYPE, 45 demonstrative, PH: 18 personal, 24 demonstrative, MF: 7 personal, 8 demonstrative PRONSUBJ, PRONINIT, SPEAKER and one of the three measures lzmeanf0, dur, energy. The models are trained to predict whether there is an antecedent (task noAnte), whether the antecedent is a non-NP (task nonNP), whether the antecedent is remote (task remote), whether the antecedent is in subject position (task u j ante), and whether the antecedent is the current Cb (task cb). All models are computed over the full data set, because the data set for speaker CK is not suflicient * for estimating the regression coefficients. The models are then compared to see which step yielded a significant improvement: adding SPEAKER or adding the prosodic variable after we have accounted for SPEAKER variation.</Paragraph> <Paragraph position="2"> Results: The results arc summarized in Table 5. On all tasks except remote, PRONTYPE and PRONSUBJ performed well. Both features have ah'oady been shown to be reliable cnes for prononn resoluti(m (c.f. Section 2). On task cb, only PRONTYPE can explain a signilicant amount of wuiation. Models which include a speaker factor ahnost always fare better. In models without speaker information, F0-relaled measures yield a larger reduction in deviance than the duration measure.</Paragraph> <Paragraph position="3"> The reason for this is that the F0 measures preserve some information about the ditl'ercnt speaker strategies. Once SPEAKER has been included as well, only dur leads to significant improvements on task nonNP (p<0.05).</Paragraph> <Paragraph position="4"> Both demonstratives and personal pronouns are shorter when the antecedent is a non-NR</Paragraph> </Section> </Section> class="xml-element"></Paper>