File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0320_metho.xml
Size: 34,427 bytes
Last Modified: 2025-10-06 14:14:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0320"> <Title>An Empirical Approach to Temporal Reference Resolution</Title> <Section position="4" start_page="174" end_page="174" type="metho"> <SectionTitle> 2 From two to four 3 Or two thirty to four thirty 4 Or three to five </SectionTitle> <Paragraph position="0"> s2 5 Then how does from two thirty to four thirty seem to you</Paragraph> </Section> <Section position="5" start_page="174" end_page="176" type="metho"> <SectionTitle> 6 On Thursday </SectionTitle> <Paragraph position="0"> An example of temporal reference resolution is that (2) refers to 2-4pm Thursday 19 August. Although related, this problem is distinct from tense and aspect interpretation in discourse (as addressed in, e.g., Webber 1988, Song & Cohen 1991, Hwang & Schubert 1992, Lascarides et al. 1992, and Kameyama et al. 1993).</Paragraph> <Paragraph position="1"> Because the dialogs are centrally concerned with negotiating an interval of time in which to hold a meeting, our representations are geared toward such intervals. Our basic representational unit is given in figure 1. To avoid confusion, we refer to this basic unit throughout as a Temporal Unit (TU).</Paragraph> <Paragraph position="2"> The time referred to in, for example, &quot;From 2 to 4, on Wednesday the 19th of August&quot; is represented as: ((August, 19th, Wednesday, 2, pm) (August, 19th, Wednesday, 4, pm)) Thus, the information from multiple noun phrases is often merged into a single representation of the underlying interval evoked by the utterance.</Paragraph> <Paragraph position="3"> An utterance such as &quot;The meeting starts at 2&quot; is represented as an interval rather than as a point in time, reflecting the orientation of the coding scheme toward intervals. Another issue this kind of utterance raises is whether or not a speculated ending time of the interval should be filled in, using knowledge of how long meetings usually last. In the CMU data, the meetings all last two hours. However, so that the instructions will be applicable to a wider class of dialogs, we decided to be conservative with respect to filling in an ending time, given the starting time (or vice versa), leaving it open unless something in the dialog explicitly suggests otherwise.</Paragraph> <Paragraph position="4"> There are cases in which times are considered as points (e.g., &quot;It is now 3pm&quot;). These are represented as Temporal Units with the same starting and ending times (as in Allen (1984)). If just one ending point is represented, all the fields of the other are null. And, of course, all fields are null for utterances that do not contain temporal information. In the case of an utterance that refers to multiple, distinct intervals, the representation is a list of Temporal Units.</Paragraph> <Paragraph position="5"> A Temporal Unit is also the representation used in the evaluation of the system. That is, the system's answers are mapped from its more complex internal representation (an ILT, see section 4.1) into this simpler vector representation before evaluation is performed.</Paragraph> <Paragraph position="6"> As in much recent empirical work in discourse processing (e.g., Arhenberg et al. 1995; Isard & Carletta 1995; Litman & Passonneau 1995; Moser & Moore 1995; Hirschberg & Nakatani 1996), we performed an intercoder reliability study investigating agreement in annotating the times. The goal in developing the annotation instructions is that they can be used reliably by non-experts after a reasonable amount of training (cf. Passonneau & Litman 1993, Condon & Cech 1995, and Hirschberg & Nakatani 1996), where reliability is measured in terms of the amount of agreement among annotators. High reliability indicates that the encoding scheme is reproducible given multiple labelers. In addition, the instructions serve to document the annotations.</Paragraph> <Paragraph position="7"> The subjects were three people with no previous involvement in the project. They were given the original Spanish and the English translations. However, as they have limited knowledge of Spanish, in essence they annotated the English translations.</Paragraph> <Paragraph position="8"> The subjects annotated two training dialogs according to the instructions. After receiving feedback, they annotated four unseen test dialogs. Inter-coder reliability was assessed using Cohen's Kappa statistic (~C/) (Siegel & Castellan 1988, Carletta 1996).</Paragraph> <Paragraph position="9"> is calculated as follows, where the numerator is the average percentage agreement among the annotators (Pa) less a term for chance agreement (Pc), and the denominator is 100% agreement less the same term for chance agreement (Pe):</Paragraph> <Paragraph position="11"> (For details on calculating Pa and Pe see Siegel & Castellan 1988). As discussed in (Hays 1988), JC/ will be 0.0 when the agreement is what one would expect under independence, and it will be 1.0 when the agreement is exact. A ~C/ value of 0.8 or greater indicates a high level of reliability among raters, with values between 0.67 and 0.8 indicating only moderate agreement (Hirschberg ~ Nakatani 1996; Carletta 1996).</Paragraph> <Paragraph position="12"> In addition to measuring intercoder reliability, we compared each coder's annotations to the evaluation Temporal Units used to assess the system's performance. These evaluation Temporal Units were assigned by an expert working on the project.</Paragraph> <Paragraph position="13"> The agreement among coders (a) is shown in table 1. In addition, this table shows the average pairwise agreement of the coders and the expert (~a~g), which was assessed by averaging the individual ~ scores (not shown). There is a moderate or high level of agreement among annotators in all cases except the ending time of day, a weakness we are investigating. Similarly, there are reasonable levels of agreement between our evaluation Temporal Units and the answers the naive coders provided.</Paragraph> <Paragraph position="14"> Busemann et al. (1997) also annotate temporal information in a corpus of scheduling dialogs. However, their annotations are at the level of individual expressions rather than at the level of Temporal Units, and they do not present the results of an intercoder reliability study.</Paragraph> </Section> <Section position="6" start_page="176" end_page="176" type="metho"> <SectionTitle> 3 Model </SectionTitle> <Paragraph position="0"> This section presents our model of temporal reference in scheduling dialogs. The treatment of anaphora in this paper is as a relationship between a Temporal Unit representing a time evoked in the current utterance, and one representing a time evoked in a previous utterance. The resolution of the anaphor is a new Temporal Unit that represents the interpretation of the contributing words of the current utterance.</Paragraph> <Paragraph position="1"> Fields of Temporal Units are partially ordered as in figure 2, from least to most specific.</Paragraph> <Paragraph position="2"> In all cases below, after the resolvent has been formed, it is subjected to highly accurate, trivial inference to produce the final interpretation (e.g., filling in the day of the week given the month and the date).</Paragraph> <Paragraph position="3"> The cases of non-anaphorie reference: 1. A deictic expression is resolved into a time interpreted with respect to the dialog date (e.g., &quot;Tomorrow&quot;, &quot;last week&quot;). (See rule NA1 in section 4.2.) 2. A forward time is calculated by using the dialog date as a frame of reference.</Paragraph> <Paragraph position="4"> Let F be the most specific field in TUcurrent above the level of time-of-day.</Paragraph> <Paragraph position="5"> Resolvent: The next F after the dialog date, augmented with the fillers of the fields in TUeurrent at or below the level of time-of-day. (See rule NA2.) For both this and anaphoric relation (3), there are subcases for whether the starting and/or ending times are involved. Note that tense can influence the choice of whether to calculate a forward or a backward time from a frame of reference (Kamp & Reyle 1993), but we do not account for this in our model due to the lack of tense variation in the corpora.</Paragraph> <Paragraph position="6"> Ex: Dialog date is Mon, 19th, Aug &quot;How about Wednesday at 2?&quot; interpreted as 2 pm, Wed 21 Aug The cases of anaphora considered: 1. The utterances evoke the same time, or the second is more specific than the first.</Paragraph> <Paragraph position="7"> Resolvent: the union of the information in the two Temporal Units. (See rule A1.) Ex: &quot;How is Tuesday, January 30th?&quot; &quot;How about 2?&quot; (See also (1)-(2) of the corpus example.) 2. The current utterance evokes a time that includes the time evoked by a previous time, and the current time is less specific. (See rule A2.) Let F be the most specific field in TUg,trent. Resolvent: All of the information in TUpre~ioua from F on up.</Paragraph> <Paragraph position="8"> Ex: &quot;How about Monday at 2?&quot; resolved to 2pm, Mon 19 Aug &quot;Ok, well, Monday sounds good.&quot; (See also (5)-(6) in the corpus example.) 3. This is the same as non-anaphoric case (2) above, but the new time is calculated with respect to TUpr~viou, instead of the dialog date. Ex: &quot;How about the 3rd week in August?&quot; &quot;Let's see, Monday sounds good.&quot; interpreted as Mon, 3rd week in Aug Ex: &quot;Would you like to meet Wed, Aug 2nd?&quot; &quot;No, how about Friday at 2.&quot; interpreted as Fri, Aug 4 at 2pm 4. The current time is a modification of the previous time; the times are consistent down to some level of specificity X and differ in the filler of X. Resolvent: The information in TUpr~iou~ above level X together with the information in TUeurrent at and below level X. (See rule A4.) Ex: &quot;Monday looks good.&quot; resolved to Mon 19 Aug &quot;How about 2?&quot; resolved to 2pm Mon 19 Aug &quot;Hmm, how about 4?&quot; resolved to 4pm Mon 19 Aug (See also (3)-(5) in the example from the corpus.) null Although we found domain knowledge and task-specific linguistic conventions most useful, we observed in the NMSU data some instances of potentially exploitable syntactic information to pursue in future work (Grosz et al. 1995, Sidner 1979). For example, &quot;until&quot; in the following suggests that the first utterance specifies an ending time.</Paragraph> <Paragraph position="9"> &quot;... could it be until around twelve?&quot; &quot;12:30 there&quot; A preference for parallel syntactic roles might be used to recognize that the second utterance specifies an ending time too.</Paragraph> </Section> <Section position="7" start_page="176" end_page="179" type="metho"> <SectionTitle> 4 The Algorithm </SectionTitle> <Paragraph position="0"> This section presents our algorithm for temporal reference resolution. After a brief overview, the rule-application architecture is described and then the rules composing the algorithm are given.</Paragraph> <Paragraph position="1"> As mentioned earlier, this is a high-level algorithm. Description of the complete algorithm, including a specification of the normalized input representation (see section 4.1), can be obtained from a report available at the project web page (http://crl.nmsu.edu/Research/Projects/artwork).</Paragraph> <Paragraph position="2"> There is a rule for each of the relations presented in section 3. Those for the anaphoric relations involve various applicability conditions on the current utterance and a potential antecedent. For the current not-yet-resolved Temporal Unit, each rule is applied. For the anaphoric rules, the antecedent considered is the most recent one meeting the conditions. All consistent maximal mergings of the results are formed, and the one with the highest score is the chosen interpretation.</Paragraph> <Section position="1" start_page="176" end_page="178" type="sub_section"> <SectionTitle> 4.1 Architecture </SectionTitle> <Paragraph position="0"> Following (Qu et al. 1996) and (Shum et al. 1994), the representation of a single utterance is called an ILT (for InterLingual Text). An ILT, once it has been augmented by our system with temporal (and speech-act) information, is called an augmented ILT (an AILT). The input to our system, produced by a semantic parser (Shum et al. 1994; Lavie & Tomita 1993), consists of multiple alternative ILT representations of utterances. To produce one ILT, the parser maps the main event and its participants into one of a small set of case frames (for example, a meet frame or an is busy frame) and produces a surface representation of any temporal information, which is faithful to the input utterance. Although the events and states discussed in the NMSU data are often outside the coverage of this parser, the temporal information generally is not. Thus, the parser provides us with a sufficient input representation for our purposes on both sets of data. This parser is proprietary, but it would not be difficult to produce just the portion of the temPOral information that our system requires.</Paragraph> <Paragraph position="1"> Because the input consists of alternative sequences of ILTs, the system resolves the ambiguity in batches. In particular, for each input sequence of ILTs, it produces a sequence of AILTs and then chooses the best sequence for the corresponding utterances. In this way, the input ambiguity is resolved as a function of finding the best temporal interpreta- null tions of the utterance sequences in context (as suggested in Qu et al. 1996).</Paragraph> <Paragraph position="2"> A focus list keeps track of what has been discussed so far in the dialog. After a final AILT has been created for the current utterance, the AILT and the utterance are placed together on the focus list (where they are now referred to as a discourse entity, or DE). In the case of utterances that evoke more than one Temporal Unit, a separate entity is added for each to the focus list in order of mention.</Paragraph> <Paragraph position="3"> Otherwise, the system architecture is similar to a standard production system, with one major exception: rather than choosing the results of just one of the rules that fires (i.e., conflict resolution), multiple results can be merged. This is a flexible architecture that accommodates sets of rules targeting different aspects of interpretation, allowing the system to take advantage of constraints that exist between them (for example, temporal and speech act rules).</Paragraph> <Paragraph position="4"> Step 1. The input ILT is normalized. In the input ILT, different pieces of information about the same time might be represented separately in order to capture relationships among clauses. Our system needs to know which pieces of information are about the same time (but does not need to know about the additional relationships). Thus, we map from the input representation into a normalized form that shields the reasoning component from the idiosyncracies of the input representation. After the normalization process, highly accurate, obvious inferences are made and added to the representation.</Paragraph> <Paragraph position="5"> Step 2. All rules are applied to the normalized input. The result of a rule application is a partial AILT (PAILT)--information this rule would contribute to the interpretation of the utterance. This information includes a certainty factor representing an a priori preference for the type of anaphoric or non-anaphoric relation being established. In the case of anaphoric relations, this factor gets adjusted by a term representing how far back on the focus list the antecedent is (in rules A1-A4 in section 4.2, the adjustment is represented by distance factor in the calculation of the certainty factor CF). The result of this step is the set of PAILTs produced by the rules that fired (i.e., those that succeeded).</Paragraph> <Paragraph position="6"> Step 3. All maximal mergings of the PAILTs are created. Consider a graph in which the PAILTs are the vertices, and there is an edge between two PAILTs iff the two PAILTs are compatible. Then, the maximal cliques of the graph (i.e., the maximal complete subgraphs) correspond to the maximal mergings. Each maximal merging is then merged with the normalized input ILT, resulting in a set of AILTs.</Paragraph> <Paragraph position="7"> Step 4. The AILT chosen is the one with the highest certainty factor. The certainty factor of an AILT is calculated as follows. First, the certainty factors of the constituent PAILTs are summed. Then, critics are applied to the resulting AILT, lowering the certainty factor if the information is judged to be incompatible with the dialog state.</Paragraph> <Paragraph position="8"> The merging process might have yielded additional opportunity for making obvious inferences, so that process is performed again, to produce the final AILT.</Paragraph> </Section> <Section position="2" start_page="178" end_page="179" type="sub_section"> <SectionTitle> 4.2 Temporal Resolution Rules </SectionTitle> <Paragraph position="0"> The rules described in this section (see figure 3) apply to individual temporal units and return either a more-fully specified TU or an empty structure to indicate failure.</Paragraph> <Paragraph position="1"> Many of the rules calculate temporal information with respect to a frame of reference, using a separate calendar utility. The following describe these and other functions assumed by the rules below, as well as some conventions used.</Paragraph> <Paragraph position="2"> next(TimeValue, RF): returns the next timeValue that follows reference frame RF.</Paragraph> <Paragraph position="3"> next(Monday, \[...Friday, 19th,...\]) = Monday, 22nd.</Paragraph> <Paragraph position="4"> resolve_deictic(DT, RF): resolves the deictic term DT with respect to the reference frame RF.</Paragraph> <Paragraph position="5"> merge(TU1, TU2): if temporal units TU1 and TU2 contain no conflicting field fillers, returns a temporal unit containing all of the information in the two; otherwise returns {}.</Paragraph> <Paragraph position="6"> merge_upper(TU1, TU2): like the previous function, except includes only those field fillers from TU1 that are of the same or less specificity as the most specific field filler in TU2.</Paragraph> <Paragraph position="7"> specificity(TU): returns the specificity of the most specific field in TU.</Paragraph> <Paragraph position="8"> starting..fields(TU): returns a list of starting field names for those in TU having non-null values.</Paragraph> <Paragraph position="9"> structure--~component: returns the named component of the structure.</Paragraph> <Paragraph position="10"> conventions: Values are in bold face and variables are in italics. TU is the current temporM unit being resolved. TodaysDate is a representation of the dialog date. FocusList is the list of discourse entities from all previous utterances. The algorithm does not cover a number of sub-cases of relations concerning the ending times. For instance, rule NA2 covers only the starting-time case of non-anaphoric relation 2. An example of an ending-time case that is not handled is the utterance &quot;Let'smeet until Thursday,&quot; under the meaning that they should meet from today through Thursday. This is an area for future work.</Paragraph> </Section> </Section> <Section position="8" start_page="179" end_page="182" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> As mentioned in section 2, the main results are based on comparisons against human annotation of the held out test data. The results are based on straight field-by-field comparisons of the Temporal Unit representations introduced in section 2. Thus, to be considered as correct, information must not only be right, but it has to be in the right place. Thus, for example, &quot;Monday&quot; correctly resolved to Monday, 19th of August, but incorrectly treated as a starting rather than an ending time, contributes 3 errors of omission and 3 errors of commission (and no credit is given for recognizing the date).</Paragraph> <Paragraph position="1"> Detailed results for the test sets are presented next, starting with results for the CMU data (see table 2). Accuracy measures the degree to which the system produces the correct answers, while precision measures the degree to which the system's answers are correct (see the formulas in the tables). For each component of the extracted temporal structure, counts were maintained for the number of correct and incorrect cases of the system versus the tagged file. Since null values occur quite often, these two counts exclude cases when one or both of the values are null. Instead, additional counts were used for those possibilities. Note that each test set contains three complete dialogs with an average of 72 utterances per dialog.</Paragraph> <Paragraph position="2"> These results show that the system is performing with 81% accuracy overall, which is significantly better than the lower bound (defined below) of 43%. In addition, the results show a high precision of 92%.</Paragraph> <Paragraph position="3"> In some of the individual cases, however, the results could be higher due to several factors. For example, our system development was inevitably focussed more on some types of slots than others. An obvious area for improvement is the time-of-day handling.</Paragraph> <Paragraph position="4"> Also, note that the values in the Missing column are higher than those in the Extra column. This reflects the conservative coding convention, mentioned in section 2, for filling in unspecified end points.</Paragraph> <Paragraph position="5"> A system that produces extraneous values is more problematic than one that leaves entries unspecified.</Paragraph> <Paragraph position="6"> Table 3 contains the results for the system on the NMSU data. This shows that the system performs respectably, with 69% accuracy and 88% precision, on this less constrained set of data. The precision is still comparable, but the accuracy is lower since more of the entries were left unspecified. Furthermore, the lower bound for accuracy (29%) is almost 15% lower than the one for the CMU data (43%), supporting the claim that this data set is more challenging. null More details on the lower bounds for the test data sets are shown next (see table 4). These values were derived by disabling all the rules and just evaluating the input as is (after performing normalization, so the evaluation software could be applied). Since 'null' is the most frequent value for all the fields, this is equivalent to using a naive algorithm that selects the most frequent value for a given field. The right-most column shows that there is a small amount of error in the input representation. This figure is 1 minus the precision of the input representation (after normalization). Note, however, that this is a close but not entirely direct measure of the error in the input, because there are a few cases of the normalization process committing errors and a few of it correcting them. Recall that the input is ambiguous; the figures in table 4 are based on the system selecting the first ILT in each case. Since the parser orders the ILTs based on a measure of acceptability, this choice is likely to have the relevant temporal information.</Paragraph> <Paragraph position="7"> Since the above results are for the system taking ambiguous semantic representations as input, the evaluation does not isolate focus-related errors.</Paragraph> <Paragraph position="8"> Therefore, two tasks were performed to aid in developing the analysis presented in section 6. First, anaphoric chains and competing discourse entities were manually annotated in all of the seen data.</Paragraph> <Paragraph position="9"> Second, to aid in isolating errors due to focus issues, the system was evaluated on unambiguous, partially corrected input for all the seen data (the test sets were retained as unseen test data).</Paragraph> <Paragraph position="10"> The overall results are shown in the table 5. This includes the results described earlier to facilitate comparisons. Among the first, more constrained data, there are twelve dialogs in the training data and three dialogs in a held out test set. The average length of each dialog is approximately 65 utterances.</Paragraph> <Paragraph position="11"> Among the second, less constrained data, there are four training dialogs and three test dialogs.</Paragraph> <Paragraph position="12"> As described in the next section, our approach handles focus effectively. In both data sets, there Rules for non-anaphoric relations Rule NAI: All cases of non-anaphoric relation 1.</Paragraph> <Paragraph position="13"> if there is a deictic term, DT, in TU then return {\[when, resolve_deictic(DT, TodaysDate)\], \[certainty, 0.9\]} Rule NA2: The starting-time cases of non-anaphoric relation 2. if (most.specific(starting_fields(TU)) < time_of_day) then Let f be the most specific field in starting_fields(TU) return {\[when, next(TU-rf, TodaysDate)\], \[certainty, 0.4\]} Rules for anaphoric relations Rule hl: All cases of anaphoric relation 1.</Paragraph> <Paragraph position="14"> for each non-empty temporal unit TUII from FocusList (starting with most recent) if specificity(TU11) < specificity(TU) and not empty merge(TUlt, TU) then</Paragraph> <Paragraph position="16"> return {\[when, merge(TUlt , TU)\], \[certainty, CF\]} Rule A2: All cases of anaphoric relation 2.</Paragraph> <Paragraph position="17"> for each non-empty temporal unit TUft from FocusList (starting with most recent) if specificity(TU/t) > specificity(TU) and not empty merge_upper(TUft, TU) then</Paragraph> <Paragraph position="19"> return {\[when, merge_upper(TUlt , TU)\], \[certainty, eel} Rule A3: Starting-time case of anaphoric relation 3.</Paragraph> <Paragraph position="20"> if (most.specific(starting_fields(TU)) < time_of_day) then for each non-empty temporal unit TUI~ from FocusList (starting with most recent) if specificity(TU) > specificity(TU1~) then Let f be the most specific field in starting_fields(TU)</Paragraph> <Paragraph position="22"> System and key agree on non-null value System and key differ on non-null value System has null value for non-null key System has non-null value for null key Both System and key give null answer accuracy lower bound percentage of key values matched correctly are noticeable gains in performance on the seen data going from ambiguous to unambiguous input, especially for the NMSU data. Therefore, the ambiguity in the dialogs contributes much to the errors. The better performance on the unseen, ambiguous NMSU data over the seen, ambiguous, NMSU data is due to several reasons. For instance, there is vast ambiguity in the seen data. Also, numbers are mistaken by the input parser for dates (e.g., phone numbers are treated as dates). In addition, a tense filter, to be discussed below in section 6, was implemented to heuristically detect subdialogs, improving the performance of the seen NMSU ambiguous dialogs. This filter did not, however, significantly improve the performance for any of the other data, suggesting that the targeted kinds of subdialogs do not occur in the unseen data.</Paragraph> <Paragraph position="23"> The errors remaining in the seen, unambiguous NMSU data are overwhelmingly due to parser error, errors in applying the rules, errors in mistaking anaphoric references for deictic references (and vice versa), and errors in choosing the wrong anaphoric relation. As will be shown in the next section, very few errors can be attributed to the wrong entities being in focus due to not handling subdialogs or &quot;multiple threads&quot; (Ros6 et al. 1995).</Paragraph> </Section> <Section position="9" start_page="182" end_page="183" type="metho"> <SectionTitle> 6 Global Focus </SectionTitle> <Paragraph position="0"> The algorithm is conspicuously lacking in any mechanism for recognizing the global structure of the discourse, such as in Grosz ~ Sidner (1986), Mann & Thompson (1988), Allen & Perranlt (1980), and their descendants. Recently in the literature, Walker (1996) has argued for a more linear-recency based model of Attentional State (though not that discourse structure need not be recognized), while Rosd et al. (1995) argue for a more complex model of Attentional State than is represented in most current computational theories of discourse.</Paragraph> <Paragraph position="1"> Many theories that address how Attentional State should be modeled have the goal of performing intention recognition as well. We investigate performing temporal reference resolution directly, without also attempting to recognize discourse structure or intentions. We assess the challenges the data present to our model when only this task is attempted.</Paragraph> <Paragraph position="2"> We identified how far back on the focus list one must go to find an antecedent that is appropriate according to the model. Such an antecedent need not be unique. (We also allow antecedents for which the anaphoric relation would be a trivial extension of one of the relations in the model.) The results are striking. Between the two sets of data, out of 215 anaphoric references, there are fewer than 5% for which the immediately preceding time is not an appropriate antecedent. Going back an additional time covers the remaining cases.</Paragraph> <Paragraph position="3"> The model is geared toward allowing the most recent Temporal Unit to be an appropriate antecedent.</Paragraph> <Paragraph position="4"> For example, in the example for anaphoric relation 4, the second utterance (as well as the first) is a possible antecedent of the third. A corresponding speech act analysis might be that the speaker is suggesting a modification of a previous suggestion. Considering the most recent antecedent as often as possible supports robustness, in the sense that more of the dialog is considered.</Paragraph> <Paragraph position="5"> There are subdialogs in the NMSU data (but none in the CMU data) for which our recency algorithm fails because it lacks a mechanism for recognizing subdialogs. There are five temporal references within subdialogs that recency either incorrectly interprets to be anaphoric to a time mentioned before the subdialog or incorrectly interprets to be the antecedent of a time mentioned after the subdialog.</Paragraph> <Paragraph position="6"> Fewer than 25 cumulative errors result from these primary areas. In the case of one of the primary errors, recency commits a &quot;self-correcting&quot; error; without this luck, the remainder of the dialog would have represented additional cumulative error.</Paragraph> <Paragraph position="7"> In a departure from the algorithm, the system uses simple heuristic for ignoring subdialogs: a time is ignored if the utterance evoking it is in the simple past or past perfect. This prevents a number of the above errors and suggests that changes in tense, aspect, and modality are promising clues to explore for recognizing subdialogs in this kind of data (cf., e.g., Grosz & Sidner 1986; Nakhimovsky 1988). The CMU data has very little variation in tense and aspect, the reason a mechanism for interpreting them was not incorporated into the Mgorithm.</Paragraph> <Paragraph position="8"> Ros@ et al. (1995) report that &quot;multiple threads&quot;, when the participants are negotiating separate times, pose challenges to a stack-based discourse model on both the intentional and attentional levels.</Paragraph> <Paragraph position="9"> They posit a more complex representation of Attentional State to meet these challenges. They report improved results on speech-act resolution in a corpus of scheduling dialogs.</Paragraph> <Paragraph position="10"> Here, we focus on just the attentionM level. The structure relevant for the task addressed in this paper is the following, corresponding to their figure 2. There are four Temporal Units mentioned in the order TU1, TU2, TU3, TU4 (other times could be mentioned in between). The (attentional) multiple thread case is when TU1 is required to be an antecedent of TU3, but TU2 is also needed to interpret TU4. Thus, TU2 cannot be simply thrown away or ignored once we are done interpreting TUs. This structure would definitely pose a difficult problem for our algorithm, but there are no realizations, in terms of our model, of this structure in the data we analyzed.</Paragraph> <Paragraph position="11"> The different findings might be due to the fact that different problems are being addressed. Having no intentional state, our model does not distinguish times being negotiated from other times. It is possible that another structure is relevant for the intentional level: Ros@ et al. (1995) do not specify whether or not this is so. The different findings may also be due to differences in the data: although their scheduling dialogs were collected under similar protocols, their protocol is like a radio conversation in which a button must be pressed in order to transmit, resulting in less dynamic interaction and longer turns (Villa 1994).</Paragraph> <Paragraph position="12"> An important discourse feature of the dialogs is the degree of redundancy of the times mentioned (Walker 1996). This limits the ambiguity of the times specified, and it also leads to a higher level of robustness, since additional DE's with the same time are placed on the focus list. These &quot;backup&quot; DE's might be available in case the rule applications fail on the most recent DE. Table 6 presents measures of redundancy. For illustration, the redundancy is broken down into the case where redundant plus additional information is provided (&quot;redundant&quot;) versus the case where the temporM information is just repeated (&quot;reiteration&quot;). This shows that roughly 25% of the CMU utterances with temporal information contain redundant temporal references, while 20% of the NMSU ones do.</Paragraph> </Section> class="xml-element"></Paper>