File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0905_intro.xml
Size: 7,350 bytes
Last Modified: 2025-10-06 14:03:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0905"> <Title>Evaluating Knowledge-based Approaches to the Multilingual Extension of a Temporal Expression Normalizer</Title> <Section position="3" start_page="0" end_page="31" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In recent years, inspired by the success of MUC evaluations, a growing number of initiatives (e.g.</Paragraph> <Paragraph position="1"> TREC1, CLEF2, CoNLL3, Senseval4) have been developed to boost research towards the automatic understanding of textual data. Since 1999, the Automatic Content Extraction (ACE) program5 has been contributing to broaden the varied scenario of evaluation campaigns by proposing three main tasks, namely the recognition of entities, relations, and events. In 2004, the Timex2 Detection and Recognition task6 (also known as TERN, for Time Expression Recognition and Normalization) has been added to the ACE program, making the whole evaluation exercise more complete. The main goal of the task was to foster research on systems capable of automatically detecting temporal expressions (TEs) present in an English text, and normalizing them with respect to a specifically defined annotation standard.</Paragraph> <Paragraph position="2"> Within the above mentioned evaluation exercises, the research activity on monolingual tasks has gradually been complemented by a considerable interest towards multilingual and cross-language capabilities of NLP systems. This trend confirms how portability across languages has now become one of the key challenges for Natural Language Processing research, in the effort of breaking the language barrier hampering systems' application in many real use scenarios. In this direction, machine learning techniques have become the standard approach in many NLP areas. This is motivated by several reasons, including i) the fact that considerable amounts of annotated data, indispensable to train ML-based algorithms, are now available for many tasks, and ii) the difficulty, inherent to rule-based approaches, of porting language models from one language to new ones. In fact, while supervised ML algorithms can be easily extended to new languages given an annotated training corpus, rule-based approaches require to redefine the set of rules, adapting them to each new language. This is a time consuming and costly work, as it usually consists in manually rewriting from scratch huge amounts of rules.</Paragraph> <Paragraph position="3"> In spite of their effectiveness for some tasks, ML techniques still fall short from providing effective solutions for others. This is confirmed by the outcomes of the TERN 2004 evaluation, which provide a clear picture of the situation. In spite of the good results obtained in the TE recognition task (Hacioglu et al., 2005), the normalization by means of ML techniques has not been tackled yet, and still remains an unresolved problem.</Paragraph> <Paragraph position="4"> Considering the inadequacy of ML techniques to deal with the normalization problem, and focusing on portability across languages, this paper extends and completes the previous work presented in (Saquete et al., 2006b) and (Saquete et al., 2006a). More specifically, we address the following crucial issue: how to minimize the costs of building a rule-based TE recognition system for a new language, given an already existing system for another language. Our goal is to experiment with different automatic porting procedures to build temporal models for new languages, starting from previously defined ones. Still adhering to the rule-based paradigm, we analyse different porting methodologies that automatically learn the TE recognition model used by the system in one language, adjusting the set of normalization rules for the new target language.</Paragraph> <Paragraph position="5"> In order to provide a clear and comprehensive overview of the challenge, an incremental approach is proposed. Starting from the architecture of an existing system developed for Spanish (Saquete et al., 2005), we present a bunch of experiments which take advantage of different knowledge sources to build an homologous system for Italian. Building on top of each other, such experiments aim at incrementally analyzing the contribution of additional information to attack the TE normalization task. More specifically, the following information will be considered: The TERN task consists in automatically detecting, bracketing, and normalizing all the time expressions mentioned within an English text. The recognized TEs are then annotated according to the TIMEX2 annotation standard described in (Ferro et al., 2005). Markable TEs include both absolute (or explicit) expressions (e.g. &quot;April 15, 2006&quot;), and relative (or anaphoric) expressions (e.g. &quot;three years ago&quot;). Also markable are durations (e.g. &quot;two weeks&quot;), event-anchored expressions (e.g. &quot;two days before departure&quot;), and sets of times (e.g. &quot;every week&quot;). Detection and bracketing concern systems' capability to recognize TEs within an input text, and correctly determine their extension. Normalization concerns the ability of the system to correctly assign, for each detected TE, the correct values to the TIMEX2 normalization attributes. The meaning of these attributes can be summarized as follows: * VAL: contains the normalized value of a TE (e.g. &quot;2004-05-06&quot; for &quot;May 6th, 2004&quot;) * ANCHOR VAL: contains a normalized form of an anchoring date-time.</Paragraph> <Paragraph position="6"> * ANCHOR DIR: captures the relative direction-orientation between VAL and ANCHOR VAL.</Paragraph> <Paragraph position="7"> * MOD: captures temporal modifiers (possible values include: &quot;approximately&quot;, &quot;more than&quot;, &quot;less than&quot;) * SET: identifies expressions denoting sets of times (e.g. &quot;every year&quot;).</Paragraph> <Section position="1" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 2.1 The evaluation benchmark </SectionTitle> <Paragraph position="0"> Moving to a new language, an evaluation benchmark is necessary to test systems performances.</Paragraph> <Paragraph position="1"> For this purpose, the temporal annotations of the</Paragraph> </Section> <Section position="2" start_page="30" end_page="31" type="sub_section"> <SectionTitle> Italian Content Annotation Bank (I-CAB-temp7) </SectionTitle> <Paragraph position="0"> have been selected.</Paragraph> <Paragraph position="1"> I-CAB consists of 525 news documents taken from the Italian newspaper L'Adige (http://www.adige.it), and contains around 182,500 words. Its 3,830 temporal expressions (2,393 in the training part of the corpus, and 1,437 in the test part) have been manually annotated following the TIMEX2 standard with some adaptations to the specific morpho-syntactic features of Italian, which has a far richer morphology than English (see (Magnini et al., 2006) for further details).</Paragraph> <Paragraph position="2"> 7I-CAB is being developed as part of the three-year project ONTOTEXT funded by the Provincia Autonoma di Trento, Italy. See http://tcc.itc.it/projects/ontotext</Paragraph> </Section> </Section> class="xml-element"></Paper>