File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1020_intro.xml
Size: 3,800 bytes
Last Modified: 2025-10-06 14:00:41
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1020"> <Title>Multilingual Coreference Resolution</Title> <Section position="3" start_page="0" end_page="142" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The recent availability of large bilingual corpora has spawned interest in several areas of multilingual text processing. Most of the research has focused on bilingual terminology identification, either as parallel multiwords forms (e.g. the ChampoUion system (Smadja et a1.1996)), technical terminology (e.g.</Paragraph> <Paragraph position="1"> the Termight system (Dagan and Church, 1994) or broad-coverage translation lexicons (e.g. the SABLE system (Resnik and Melamed, 1997)). In addition, the Multilingual Entity Task (MET) from the TIP-STER program 1 (http://www-nlpir.nist.gov/relatedprojeets/tipster/met.htm) challenged the participants in the Message Understanding Conference (MUC) to extract named entities across several foreign language corpora, such as Chinese, Japanese and Spanish.</Paragraph> <Paragraph position="2"> In this paper we present a new application of aligned multilinguai texts. Since coreference resolution is a pervasive discourse phenomenon causing performance impediments in current IE systems, we considered a corpus of aligned English and Romanian texts to identify coreferring expressions. Our task focused on the same kind of coreference as considered in the past MUC competitions, namely government effort to advance the state of the art in text processing technologies.</Paragraph> <Paragraph position="3"> the identity coreference. Identity coreference links nouns, pronouns and noun phrases (including proper names) to their corresponding antecedents.</Paragraph> <Paragraph position="4"> We created our bilingual collection by translating the MUC-6 and MUC-7 coreference training texts into Romanian using native speakers. The training data set for Romanian coreference used, wherever possible, the same coreference identifiers as the English data and incorporated additional tags as needed. Our claim is that by adding the wealth of coreferential features provided by multilingual data, new powerful heuristics for coreference resolution can be developed that outperform monolingual coreference resolution systems.</Paragraph> <Paragraph position="5"> For both languages, we resolved coreference by using SWIZZLE, our implementation of a bilingual coreference resolver. SWIZZLE is a multilingual enhancement of COCKTAIL (Harabagiu and Maiorano, 1999), a coreference resolution system that operates on a mixture of heuristics that combine semantic and textual cohesive information 2. When COCKTAIL was applied separately on the English and the Romanian texts, coreferring links were identified for each English and Romanian document respectively.</Paragraph> <Paragraph position="6"> When aligned referential expressions corefer with non-aligned anaphors, SWIZZLE derived new heuristics for coreference. Our experiments show that SWIZZLE outperformed COCKTAIL on both English and Romanian test documents.</Paragraph> <Paragraph position="7"> The rest of the paper is organized as follows. Section 2 presents COCKTAIL, a monolingnai coreference resolution system used separately on both the English and Romanian texts. Section 3 details the data-driven approach used in SWIZZLE and presents some of its resources. Section 4 reports and discusses the experimental results. Section 5 summarizes the 2The name of COCKTAIL is a pun on CogNIAC because COCKTAIL combines a larger number of heuristics than those reported in (Baldwin, 1997). SWIZZLE, moreover, adds new heuristics, discovered from the bilingual aligned corpus.</Paragraph> <Paragraph position="8"> conclusions.</Paragraph> </Section> class="xml-element"></Paper>