File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0611_metho.xml
Size: 14,465 bytes
Last Modified: 2025-10-06 14:08:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0611"> <Title>Learning the Meaning and Usage of Time Phrases from a Parallel Text-Data Corpus</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 SumTime Project and Corpora </SectionTitle> <Paragraph position="0"> The SUMTIME project is investigating better technology for building software systems that automatically generate textual summaries of time-series data. One of the domains SUMTIME is working in is weather forecasts, and in this domain we acquired a corpus of 1119 weather forecasts (for off-shore oil rigs) written by five professional meteorologists (Sripada et al., 2002; Sripada et al., 2003b). The reports were primarily based on the output of a numerical weather simulation, and our corpus contains this information as well as the forecast texts. Each forecast is roughly 400 words long, giving a total corpus size of about 400,000 words. The forecasts are split into an initial section which gives an overview of the weather, and then additional sections which give detailed forecasts for different periods of time. Figure 1 shows an example extract from a forecast text; this is the detailed description of predicted weather on 25 Oct 2000, from a forecast issued at 3AM on 24 Oct 2000.</Paragraph> <Paragraph position="1"> Much of our analysis has focused on statements describing predicted wind speed and direction at 10 meters altitude during the first 72 hours after the forecast was issued. In other words, the WIND(10M) field from the detailed weather descriptions up to 3 days after the forecast was issued. One reason for focusing on wind statements is that they are based fairly directly on two fields from the data files, predicted wind direction and speed; the relationship between some of the other statements (such as weather) and the data files is more complex. The predicted wind (at 10m) speed and direction on 25 Oct 2000, from the 24 Oct 2000 data file, is shown in Table 1. This is the primary information that the meteorologists looked at when writing the wind statement in Figure 1, although they also have access to other information sources, such as satellite weather photographs.</Paragraph> <Paragraph position="2"> Each forecast contains 3 such wind statements, with an average length of approximately 10 words; hence there are about 30,000 words in our wind-statement subcorpus.</Paragraph> <Paragraph position="3"> This of course is very small compared to many text-only corpora such as the British National Corpus (BNC), but we believe that our weather forecast corpus is one of the largest parallel text-data corpora in existence.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Analysis Procedure for Time Phrases </SectionTitle> <Paragraph position="0"> One of SUMTIME's research goals is to learn the meaning of time phrases; in other words, what a forecaster meant when he used a time phrase such as by evening or after midnight. We also wished to learn which time phrase should be included in a computer-generated weather forecast text to indicate a time; for example, which time phrase should be used to indicate a change in the weather at 1200. Note that it is rare for weather forecasts to explicitly mention numerical times such as 1200, and also that although there are standard terminologies for some meteorological phenomena such as cloud cover and precipitation, we are not aware of any standard terminologies for the use of time phrases in weather forecasts.</Paragraph> <Paragraph position="1"> We performed this analysis as follows. First we extracted the wind at 10 meters statements for the next 72 hours from all forecasts in our corpus, and parsed these texts with a simple parser tuned to the linguistic structure of these texts. The parser essentially broke sentences up into individual phrases, and then recorded the speed, direction, and time phrase mentioned in each such phrase, along with other information (such as verb) which was not used in the analysis described here. For example the WIND (10M) statement from Figure 1 was broken up by the parser into four wind phrases: 1. SSW 12-16 (speed:12-16, direction:SSW, timephrase: none) 2. BACKING ESE 16-20 IN THE MORNING, (speed:16-20, direction:ESE, timephrase: IN THE MORNING) 3. BACKING NE EARLY AFTERNOON (speed:(16-20), direction:NE, timephrase: EARLY AFTERNOON) 4. THEN NNW 24-28 LATE EVENING (speed:24-28, direction:NNW, timephrase: LATE EVENING) If a wind phrase did not specify speed or direction, the parser assumed that this was unchanged from the previous wind phrase; such elision is common in weather forecast texts. Thus, for example, the speed recorded for</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> BACKING NE EARLY AFTERNOON is 16-20, which </SectionTitle> <Paragraph position="0"> is the speed from the previous phrase (BACKING ESE 16-20 IN THE MORNING). Our parser successfully parsed 3225 of the 3357 WIND(10M) statements; 132 (4%) of the statements could not be parsed. The parser produced 8198 wind phrases in total.</Paragraph> <Paragraph position="1"> From these 8198 wind phrases we selected those phrases which (a) included a time phrase, (b) did not use a qualifier such as mainly or occasionally, (c) did not specify that wind speed or direction was variable, (d) for which we had the corresponding data files, and (e) for which we knew the forecast author. There were 3654 such phrases. The majority (4014 phrases) of the eliminated phrases did not specify a time phrase, such as the first phrase (SSW 12-16) in the above example.</Paragraph> <Paragraph position="2"> We next associated each wind phrase with an entry in the corresponding data file. In other words, we aligned the textual wind phrases with the numeric data file entries. As in other uses of parallel corpora, good alignment is essential in order for the results to be meaningful (Och and Ney, 2000).</Paragraph> <Paragraph position="3"> To associate data file entries with wind phrases, we first searched the data file for entries which matched the wind phrase. An entry matched if its speed was within the range defined in the phrase, and if its direction was within 12 degrees of the direction mentioned in the phrase. In 343 cases, no data file entry matched the wind phrase. We believe that such cases were mostly due to (a) forecasters not literally reporting the data file, but instead adjusting what they said based on their meteorological expertise and on information not available to the numerical weather simulation (such as satellite weather images); (b) forecasters reporting a simultaneous change in wind speed and direction, when in fact speed and direction changed at different times (this may be due to forecasters trying to write texts quickly, so that they can use the most up-to-date data (Reiter et al., 2003)); and (c) forecaster errors. For example, the third phrase in our example, BACKING NE EARLY AFTERNOON, does not match any of the data file entries shown in Table 1. This could be because the forecaster decided that the numerical forecast was underestimating the speed at which the wind was shifting, and hence he believed that the wind would be NE at 12 or 15, even though the data file predicted E and ENE for these times. It could also be that the forecaster made a mistake, and perhaps was intending to write ENE but wrote NE instead because he was writing under time pressure. In any case, wind phrases which did not match any data file entries were dropped from our analysis.</Paragraph> <Paragraph position="4"> Out of the 3311 matched wind phrases, 1434 (43%) were unambiguous and only matched one data file entry. For example, the fourth wind phrase in our example, THEN NNW 24-28 LATE EVENING, matches only one data file entry, the one for 0000 on 26 Oct 2000.</Paragraph> <Paragraph position="5"> 1877 (57%) of the matched wind phrases were ambiguous and matched more than one data file entry. Typically this happened when the wind was changing slowly and hence two or more adjacent data file entries matched the wind phrase. In such cases we checked if one data file entry had a speed which was was closer than the other data files entries to the middle of the speed range in the textual wind phrase. This heuristic produced a preferred match for 1105 (33%) of the matched wind phrases, and left 772 (23%) phrases as ambiguous and unmatched.</Paragraph> <Paragraph position="6"> For example, the second wind phrase in our example, BACKING ESE 16-20 IN THE MORNING, matches two data file entries: 0600 (direction ESE, speed 18) and 0900 (direction ESE, speed 16). The midpoint of the 16-20 speed range reported in the forecast is 18, so our speed heuristic matches this wind phrase to the 0600 data file entry, since its speed is closer to the speed range midpoint (indeed, it matches the midpoint).</Paragraph> <Paragraph position="7"> We evaluated our alignment process by applying it to the subset of wind phrases which used a time phrase which we thought had a clear and unambiguous interpretation, namely an absolute time (such as by 0600), by bold) midday (which means 1200), by midnight (which means 0000), and by end of period (which means either 0000 or 0600, depending on the forecast section). The matching process was fairly accurate; in 86% of cases it produced the expected meaning (such as 0000 for by midnight).</Paragraph> <Paragraph position="8"> Clearly we would benefit from better matching and alignment techniques, and we wonder if perhaps some of the alignment techniques used for parallel multi-lingual corpora (Och and Ney, 2000) could be adapted to help align our text-data corpora. This is a topic we plan to investigate in future research.</Paragraph> <Paragraph position="9"> This matching/alignment procedure is different in detail from the preliminary analysis procedure reported in (Reiter and Sripada, 2002b). The procedure used in our earlier paper aligned fewer phrases (1382 vs. 2539) and had a higher error rate (22% vs. 14%), so it was inferior.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> We examined the association between time phrase and time in the 2539 aligned (wind phrase, data file entry) pairs. In this analysis, we regarded time phrases as different if they involved different head nouns (for example, bold) by evening and by afternoon), different prepositions (for example, midday and by midday) and/or different adjectives (for example, by afternoon and by late afternoon). However, we ignored determiners (for example, by this evening was regarded as the same phrase as by evening). Tables 2, 3, and 4 gives details of the usage of the three most common non-contextual time phrases: by evening, by midday, and by late evening. This tables also shows the statistical significance of differences between forecasters, calculated with a a chi-square test (which treats time as a categorical variable). As some colleagues have expressed an interest in a one-way ANOVA analysis (which compares mean time), we show this as well where it gives a substantially different value from the chi-square analysis. This data suggests that a2 by evening means different things to different people; for example, F1 and F4 primarily use this phrase to mean 1800, while F3 primarily uses this phrase to mean 0000.</Paragraph> <Paragraph position="1"> a2 by midday was used in a very similar way by all forecasters (ignoring F2, who only used the term once). a2 by late evening was used by all forecasters (who used this term) primarily to mean 0000. However, the usages of the different forecasters was still significantly different. This reflects a difference in the distribution of usage; in particular, F3 almost always (98% of cases) used this phrase to mean 0000, while F4 and F5 used this phrase to mean 0000 in about 80% of cases.</Paragraph> <Paragraph position="2"> These patterns are replicated across the corpus: some phrases (such as by midday and by morning) are used in the same way by all forecasters; some phrases (such as by evening and by late morning) are used in very different ways by the forecasters; and some phrases (such as by late evening and by midnight) have the same core meaning (eg, 0000) but different distributions around the core. We have, incidentally, looked for seasonal variations in meaning (for example, by evening meaning one thing in the winter and another in the summer), but we have found no evidence of such variation.</Paragraph> <Paragraph position="3"> Roy (2002) has also noted variation in the meanings that individuals assign to words, in his parallel text-data study of object descriptions. For example, one object might be described as having the colour pink by one subject, but other subjects might have problems identifying the object when it was described as pink, because they did not consider it to have this colour.</Paragraph> <Paragraph position="4"> Table 5 presents the most common time-phrase used by each forecaster for each time, including context-dependent phrases such as later. This highlights major 'stylistic' differences between forecasters in terms of which time phrases they prefer to use. For example, F1 and F2 make heavy use of contextual time phrases such as later and soon, while F5 (and to a lesser extent F4) seem to prefer to avoid such terms. It is also interesting that contextual time phrases are especially commonly used to refer to the time 0300. We wonder if this could reflect a 'lexical gap' in English; there are no commonly used time phrases in English for times around 0300, and perhaps this encourages the forecasters to use contextual time phrases to refer to 0300.</Paragraph> <Paragraph position="5"> Table 6 presents the most common (mode) meaning of non-contextual time phrases, for each forecaster. Perhaps not surprisingly, the greatest variability occurs when a time phrase denoting a time period (morning, afternoon, or evening) occurs without being modified by an adjective (early, mid, or late). The data also suggests that the forecasters may disagree about the meaning of morning, with F4 in particular considering morning to be the period 0300-0900, while F5 considers morning to be the period 0600-1200.</Paragraph> </Section> class="xml-element"></Paper>