File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2810_intro.xml

Size: 1,821 bytes

Last Modified: 2025-10-06 14:04:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2810">
  <Title>FindingSimilarSentencesacrossMultipleLanguagesin Wikipedia</Title>
  <Section position="3" start_page="62" end_page="63" type="intro">
    <SectionTitle>
2 RelatedWork
</SectionTitle>
    <Paragraph position="0"> The main focus of this paper lies with multilingual text similarityand its applicationto informationaccessin the context of Wikipedia. Current researchwork related to Wikipediamostly describes its monolingualproperties(Ciffolilli, 2003; Vi'egas et al., 2004; Lih, 2004; Miller,  2005;BellomiandBonato,2005;Voss,2005;Fissaha Adafreand de Rijke, 2005). This is probablydueto thefactthatdifferentlanguageversions of Wikipediahave differentgrowth rates. Others describeits applicationin questionansweringand othertypesof IR systems(Ahnet al., 2005). We believe that currently, Wikipediapagesfor major Europeanlanguageshave reacheda level where they cansupportmultilingualresearch.</Paragraph>
    <Paragraph position="1"> Ontheotherhand,thereisa richbodyofknowledgerelatingto multilingualtext similarity. These  includeexample-basedmachinetranslation,crosslingual informationretrieval, statisticalmachine translation,sentencealignmentcostfunctions,and bilingualphrase translation(Kirk Evans, 2005).</Paragraph>
    <Paragraph position="2"> Each approachuses relatively different features (content and structural features) in identifying similartext frombilingualcorpora.Furthermore, most methodsassumethat the bilingualcorpora can be sentencealigned. This assumptiondoes not holdfor our casesinceour corpusis not parallel. In this paper, we use contentbased features for identifyingsimilartext acrossmultilingual corpora. Particularly, we comparebilingual lexiconandMTsystembasedmethodsfor identifyingsimilartext in Wikipedia.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML