File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2001_metho.xml
Size: 10,075 bytes
Last Modified: 2025-10-06 14:10:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2001"> <Title>Multilingual Extension of a Temporal Expression Normalizer using Annotated Corpora</Title> <Section position="3" start_page="1" end_page="2" type="metho"> <SectionTitle> 2 The TERSEO system architecture </SectionTitle> <Paragraph position="0"> TERSEO has been developed in order to automatically recognize temporal expressions (TEs) appearing in a Spanish written text, and normalize them according to the temporal model proposed in (Saquete, 2005), which is compatible with the ACE annotation standards for temporal expressions (Ferro et al., 2005). As shown in Figure 1, the first step (recognition) includes pre-processing of the input texts, which are tagged with lexical and morphological information that will be used as input to the temporal parser. The temporal parser is implemented using an ascending technique (chart parser) and is based on a temporal grammar. Once the parser has recognized the TEs in an input text, these are passed to the normalization unit, which updates the value of the reference according to the date they refer to, and generates the XML tags for each expression.</Paragraph> <Paragraph position="1"> As TEs can be categorized as explicit and implicit, the grammar used by the parser is tuned for discriminating between the two groups. On the architecture.</Paragraph> <Paragraph position="2"> one hand, explicit temporal expressions directly provide and fully describe a date which does not require any further reasoning process to be interpreted (e.g. &quot;1st May 2005&quot;, &quot;05/01/2005&quot;). On the other hand, implicit (or anaphoric) time expressions (e.g. &quot;yesterday&quot;, &quot;three years later&quot;) require some degree of reasoning (as in the case of anaphora resolution). In order to translate such expressions into explicit dates, such reasoning capabilities consider the information provided by the lexical context in which they occur (see (Saquete, 2005) for a thorough description of the reasoning techniques used by TERSEO).</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.1 Recognition using a temporal expression </SectionTitle> <Paragraph position="0"> parser The parser uses a grammar based on two different sets of rules. The first set of rules is in charge of date and time recognition (i.e. explicit dates, such as &quot;05/01/2005&quot;). For this type of TEs, the grammar adopted by TERSEO recognizes a large number of date and time formats (see Table 1 for some examples).</Paragraph> <Paragraph position="1"> The second set of rules is in charge of the recognition of the temporal reference for implicit TEs, i.e. TEs that need to be related to an explicit TE to be interpreted. These can be divided into time adverbs (e.g. &quot;yesterday&quot;, &quot;tomorrow&quot;), and nominal phrases that are referring to temporal relationships (e.g. &quot;three years later&quot;, &quot;the day before&quot;). Table 2 shows some of the rules used for the detection of these kinds of references.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.2 Normalization </SectionTitle> <Paragraph position="0"> When the system finds an explicit temporal expression, the normalization process is direct as no resolution of the expression is necessary. For implicit expressions, an inference engine that interprets every reference previously found in the input text is used. In some cases references are solved using the newspaper's date (FechaP). Other TEs have to be interpreted by referring to a date named before in the text that is being analyzed (FechaA).</Paragraph> <Paragraph position="1"> In these cases, a temporal model that allows the system to determine the reference date over which the dictionary operations are going to be done, has been defined. This model is based on the following two rules: 1. The newspaper's date, when available, is used as a base temporal referent by default; otherwise, the current date is used as anchor.</Paragraph> <Paragraph position="2"> 2. In case a non-anaphoric TE is found, it is stored as FechaA. This value is updated every time a non-anaphoric TE appears in the text.</Paragraph> <Paragraph position="3"> Table 3 shows some of the entries of the dictionary used in the inference engine.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="3" type="metho"> <SectionTitle> 3 Extending TERSEO: from Spanish </SectionTitle> <Paragraph position="0"> and English to Italian As stated before, the main purpose of this paper is to describe a new procedure to automatically build temporal models for new languages, starting from previously defined models. In our case, an English model has been automatically obtained from the Spanish one through the automatic translation of the Spanish temporal expressions to English. The resulting system for the recognition and normalization of English TEs obtains good results both in terms of precision (P) and recall (R) (Saquete et al., 2004). The comparison of the results between the Spanish and the English system is shown in and English TERSEO.</Paragraph> <Paragraph position="1"> This section presents the procedure we followed to extend our system to Italian, starting from the Spanish and English models already available, and a manually annotated corpus. In this case, both models have been considered as they can be complemented reciprocally. The Spanish model was</Paragraph> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> REFERENCE DICTIONARY ENTRY </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> scores for precision (88%), so better results could be expected when it is used. However, in spite of the fact that the English model has shown lower results on precision (77%), the on-line translators between Italian and English have better results than Spanish to Italian translators. As a result, both models are considered in the following steps for the multilingual extension: Firstly, a set of Italian temporal expressions is extracted from an Italian annotated corpus and stored in a database. The selected corpus is the training part of I-CAB, the Italian Content Annotation Bank (Lavelli et al., 2005). More detailed information about I-CAB is provided in Section 4.</Paragraph> <Paragraph position="3"> Secondly, the resulting set of Italian TEs must be related to the appropriate normalization rule. In order to do that, a double translation procedure has been developed. We first translate all the expressions into English and Spanish simultaneously; then, the normalization rules related to the translated expressions are obtained. If both the Spanish and English expressions are found in their respective models in agreement with the same normalization rule, then this rule is also assigned to the Italian expression. Also, when only one of the translated expressions is found in the existing models, the normalization rule is assigned. In case of discrepancies, i.e. if both expressions are found, but not coinciding in the same normalization rule, then one of the languages must be prioritized. As the Spanish model was manually obtained and has shown a higher precision, Spanish rules are preferred. In other cases, the expression is reserved for a manual assignment.</Paragraph> <Paragraph position="4"> Finally, the set is automatically augmented using the Spanish and English sets of temporal expressions. These expressions were also translated into Italian by on-line machine translation systems (Spanish-Italian4, English-Italian5). In this case, a filtering module is used to guarantee that all the expressions were correctly translated. This module searches the web with Google6 for the translated expression. If the expression is not frequently found, then the translation is abandoned. After that, the new Italian expression is included in the model, and related to the same normalization rule assigned to the Spanish or English temporal expression.</Paragraph> <Paragraph position="5"> The entire translation process has been completed with an automatic generalization process, oriented to obtain generalized rules from the concrete cases that have been collected from the cor- null pus. This generalization process has a double effect. On the one hand, it reduces the number of recognition rules. On the other hand, it allows the system to identify new expressions that were not previously learned. For instance, the expression &quot;Dieci mesi dopo&quot; (i.e. &quot;Ten months later&quot;) could be recognized if the expression &quot;Nove mesi dopo&quot; (i.e. Nine months later) was learned.</Paragraph> <Paragraph position="6"> The multilingual extension procedure (Figure 3) is carried out in three phases: the Italian temporal expressions are collected from I-CAB (Italian Content Annotation Bank), and the automatically translated Italian TEs are derived from the set of Spanish and English TEs. In this case, the TEs are filtered removing those not being found by Google.</Paragraph> <Paragraph position="7"> Phase 2: TE Generalization. In this phase, the TEs Gramatics Generator uses the morphological and syntactical information from the collected TEs to generate the grammatical rules that generalize the recognition of the TEs. Moreover, the keyword unit is able to extract the temporal keywords that will be used to build new TEs. These keywords are augmented with their synonyms in WordNet (Vossen, 2000) to generate new TEs.</Paragraph> <Paragraph position="8"> Phase 3: TE Normalizing Rule Assignment.</Paragraph> <Paragraph position="9"> In the last phase, the translators are used to relate the recognizing rule to the appropriate normalization rule. For this purpose, the system takes advantage of the previously defined Spanish and English temporal models.</Paragraph> </Section> class="xml-element"></Paper>