File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1312_metho.xml

Size: 13,806 bytes

Last Modified: 2025-10-06 14:07:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1312">
  <Title>A Multilingual Approach to Annotating and Extracting Temporal Information</Title>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 Interlingual Representation
2.1 Introduction
</SectionTitle>
    <Paragraph position="0"> Although the guidelines were developed with detailed examples drawn from English (along with English-specific tokenization rules and guidelines for determining tag extent), the semantic representation we use is intended for use across languages. This will permit the development of temporal taggers for different languages trained using a common annotation scheme.</Paragraph>
    <Paragraph position="1"> It will also allow for new methods for evaluating machine translation of temporal expressions at the level of interpretation as well as at the surface level. As discussed in (Hirschman et al. 2000), time expressions generally fall into the class of so-called named entities, which includes proper names and various kinds of numerical expressions. The translation of named entities is less variable stylistically than the translation of general text, and once predictable variations due to differences in transliteration, etc. are accounted for, the alignment of the machine-translated expressions with a reference translation produced by a human can readily be accomplished. A variant of the word-error metric used to evaluate the output of automatic speech transcription can then be applied to produce an accuracy score. In the case of our current work on temporal expressions, it will also be possible to use the normalized time values to participate in the alignment and scoring.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Semantic Distinctions
</SectionTitle>
      <Paragraph position="0"> Three different kinds of time values are represented: points in time (answering the question &amp;quot;when?&amp;quot;), durations (answering &amp;quot;how long?&amp;quot;), and frequencies (answering &amp;quot;how often?&amp;quot;).</Paragraph>
      <Paragraph position="1"> * Points in time are calendar dates and timesof-day, or a combination of both, e.g., Monday 3 pm, Monday next week, a Friday, early Tuesday morning, the weekend. These are all represented with values (the tag attribute VAL) in the ISO format, which allows for representation of date of the month, month of the year, day of the week, week of the year, and time of day, e.g.,  rather than particular points. SET and GRANULARITY attributes are used for such expressions, with the PERIODICITY attribute being used for regularly recurring times, e.g., &lt;TIMEX2 VAL=&amp;quot;XXXX-WXX-</Paragraph>
      <Paragraph position="3"> Here &amp;quot;F1W&amp;quot; means frequency of once a week, and the granularity &amp;quot;G1D&amp;quot; means the set members are counted in day-sized units.</Paragraph>
      <Paragraph position="4"> The annotation scheme also addresses several semantic problems characteristic of temporal expressions: * Fuzzy boundaries. Expressions like Saturday morning and Fall are fuzzy in their intended value with respect to when the time period starts and ends; the early 60's is fuzzy as to which part of the 1960's is included. Our format for representing time values includes parameters such as FA (for Fall), EARLY (for early, etc.), PRESENT_REF (for today, current, etc.), among others. For example, we have  than a decade ago&lt;/TIMEX2&gt;. The intent here is that a given application may choose to assign specific values to these parameters if desired; the guidelines themselves don't dictate the specific values.</Paragraph>
      <Paragraph position="5"> * Non-Specificity. Our scheme directs the annotator to represent the values, where possible, of temporal expressions that do not indicate a specific time. These non-specific expressions include generics, which state a generalization or regularity of some kind, e.g., &lt;TIMEX2 VAL=&amp;quot;XXXX-04&amp;quot; NON_SPECIFIC=&amp;quot;YES&amp;quot;&gt;April&lt;/TIMEX&gt; is usually wet, and non-specific indefinites,</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Reference Corpus
</SectionTitle>
    <Paragraph position="0"> Based on the guidelines, we have arranged for 6 subjects to annotate an English reference corpus, consisting of 32,000 words of a telephone dialog corpus - English translations of the 'Enthusiast' corpus of Spanish meeting scheduling dialogs used at CMU and by (Wiebe et al. 1998), 35,000 words of New York Times newspaper text and 120,000 words of broadcast news (TDT2 1999).</Paragraph>
    <Paragraph position="1"> This corpus will soon be made available to the research community.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="4" type="metho">
    <SectionTitle>
4 Time Tagger System
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Architecture
</SectionTitle>
      <Paragraph position="0"> The tagging program takes in a document which has been tokenized into words and sentences and tagged for part-of-speech. The program passes each sentence first to a module that flags time expressions, and then to another module (SC) that resolves self-contained (i.e., 'absolute') time expressions. Absolute expressions are typically processed through a lookup table that translates them into a point or period that can be described by the ISO standard.</Paragraph>
      <Paragraph position="1"> The program then takes the entire document and passes it to a discourse processing module (DP) which resolves context-dependent (i.e., 'relative') time expressions (indexicals as well as other expressions). The DP module tracks transitions in temporal focus, using syntactic clues and various other knowledge sources.</Paragraph>
      <Paragraph position="2"> The module uses a notion of Reference Time to help resolve context-dependent expressions. Here, the Reference Time is the time a context-dependent expression is relative to. The reference time (italicized here) must either be described (as in &amp;quot;a week from Wednesday&amp;quot;) or implied (as in &amp;quot;three days ago [from today]&amp;quot;). In our work, the reference time is assigned the value of either the Temporal Focus or the document (creation) date. The Temporal Focus is the time currently being talked about in the narrative. The initial reference time is the document date.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Assignment of Time Values
</SectionTitle>
      <Paragraph position="0"> We now discuss the assigning of values to identified time expressions. Times which are fully specified are tagged with their value, e.g, &amp;quot;June 1999&amp;quot; as 1999-06 by the SC module. The DP module uses an ordered sequence of rules to handle the context-dependent expressions. These cover the following cases: * Explicit offsets from reference time: indexicals like &amp;quot;yesterday&amp;quot;, &amp;quot;today&amp;quot;, &amp;quot;tomorrow&amp;quot;, &amp;quot;this afternoon&amp;quot;, etc., are ambiguous between a specific and a non-specific reading. The specific use (distinguished from the generic one by machine learned rules discussed in (Mani and Wilson 2000)) gets assigned a value based on an offset from the reference time, but the generic use does not. For example, if &amp;quot;fall&amp;quot; is immediately preceded by &amp;quot;last&amp;quot; or &amp;quot;next&amp;quot;, then &amp;quot;fall&amp;quot; is seasonal (97.3% accurate rule). If &amp;quot;fall&amp;quot; is followed 2 words after by a year expression, then &amp;quot;fall&amp;quot; is seasonal (86.3% accurate).</Paragraph>
      <Paragraph position="1"> * Positional offsets from reference time: Expressions like &amp;quot;next month&amp;quot;, &amp;quot;last year&amp;quot; and &amp;quot;this coming Thursday&amp;quot; use lexical markers (underlined) to describe the direction and magnitude of the offset from the reference time.</Paragraph>
      <Paragraph position="2"> * Implicit offsets based on verb tense: Expressions like &amp;quot;Thursday&amp;quot; in &amp;quot;the action taken Thursday&amp;quot;, or bare month names like &amp;quot;February&amp;quot; are passed to rules that try to determine the direction of the offset from the reference time, and the magnitude of the offset. The tense of a neighboring verb is used to decide what direction to look to resolve the expression.</Paragraph>
      <Paragraph position="3"> * Further use of lexical markers: Other expressions lacking a value are examined for the nearby presence of a few additional markers, such as &amp;quot;since&amp;quot; and &amp;quot;until&amp;quot;, that suggest the direction of the offset.</Paragraph>
      <Paragraph position="4"> * Nearby Dates: If a direction from the reference time has not been determined, some dates, like &amp;quot;Feb. 14&amp;quot;, and other expressions that indicate a particular date, like &amp;quot;Valentine's Day&amp;quot;, may still be untagged because the year has not been determined. If the year can be chosen in a way that makes the date in question less than a month from the reference date, that year is chosen. Dates more than a month away are not assigned values by this rule.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
4.3 Time Tagging Performance
</SectionTitle>
      <Paragraph position="0"> The system performance on a test set of 221 articles from the print and broadcast news section of the reference corpus (the test set had a total of 78,171 words) is shown in Table 1  .</Paragraph>
      <Paragraph position="1"> Note that if the human said the tag had no value, and the system decided it had a value, this is treated as an error. A baseline of just tagging values of absolute, fully specified expressions (e.g., &amp;quot;January 31 st , 1999&amp;quot;) is shown for comparison in parentheses.</Paragraph>
      <Paragraph position="2">  The development of a tagging program for other languages closely parallels the process for English and reuses some of the code. Each language has its own set of lexical trigger words that signal a temporal expression. Many of these, e.g. day, week, etc., are simply translations of English words.</Paragraph>
      <Paragraph position="3"> Often, there will be some additional triggers with no corresponding word in English. For example, some languages contain a single lexical item that would translate in English as &amp;quot;the day after tomorrow&amp;quot;. For each language, the triggers and lexical markers must be identified.</Paragraph>
      <Paragraph position="4"> As in the case of English, the SC module for a new language handles the case of absolute expressions, with the DP module  The evaluated version of the system does not adjust the Reference Time for subsequent sentences. handling the relative ones. It appears that in most languages, in the absence of other context, relative expressions with an implied reference time are relative to the present. Thus, tools built for one language that compute offsets from a base reference time will carry over to other languages.</Paragraph>
      <Paragraph position="5"> As an example, we will briefly describe the changes that were needed to develop a Spanish module, given our English one. Most of the work involved pairing the Spanish surface forms with the already existing computations, e.g. we already computed &amp;quot;yesterday&amp;quot; as meaning &amp;quot;one day back from the reference point&amp;quot;. This had to be attached to the new surface form &amp;quot;ayer&amp;quot;. Because not all computers generate the required character encodings, we allowed expressions both with and without diacritical marks, e.g., manana and manana. Besides the surface forms, there are a few differences in conventions that had to be accounted for. Times are mostly stated using a 24-hour clock. Dates are usually written in the European form day/month/year rather than the US-English convention of month/day/year.</Paragraph>
      <Paragraph position="6"> A difficulty arises because of the use of multiple calendric systems. While the Gregorian calendar is widely used for business across the world, holidays and other social events are often represented in terms of other calendars. For example, the month of Ramadan is a regularly recurring event in the Islamic calendar, but shifts around in the Gregorian  .</Paragraph>
      <Paragraph position="7"> Here are some examples of tagging of parallel text from Spanish and English with a common representation.</Paragraph>
      <Paragraph position="8">  Our annotation guidelines state that a holiday name is markable but should receive a value only when that value can be inferred from the context of the text, rather than from cultural and world knowledge.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="4" end_page="4" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> Our scheme differs from the recent scheme of (Setzer and Gaizauskas 2000) in terms of our in-depth focus on representations for the values of specific classes of time expressions, and in the application of our scheme to a variety of different genres, including print news, broadcast news, and meeting scheduling dialogs. Others have used temporal annotation schemes for the much more constrained domain of meeting scheduling, e.g., (Wiebe et al. 1998), (Alexandersson et al. 1997), (Busemann et al.</Paragraph>
    <Paragraph position="1"> 1997). Our scheme has been applied to such domains as well, our annotation of the Enthusiast corpus being an example.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML