File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2810_metho.xml
Size: 13,356 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2810"> <Title>FindingSimilarSentencesacrossMultipleLanguagesin Wikipedia</Title> <Section position="4" start_page="63" end_page="63" type="metho"> <SectionTitle> 3 Wikipediaas a MultilingualCorpus </SectionTitle> <Paragraph position="0"> Wikipediais a free onlineencyclopediawhichis administeredby the non-profitWikimediaFoundation. The aim of the projectis to develop free encyclopediasfor differentlanguages.It is a collaborative effortofa communityofvolunteers,and its contentcanbe editedbyanyone.It is attracting increasingattentionamongstweb users and has joinedthetop50 mostpopularsites.</Paragraph> <Paragraph position="1"> As of January1, 2006, there are versionsof Wikipediain morethan200languages,withsizes rangingfrom1 to over 800,000articles.We used the ascii text version of the Englishand Dutch Wikipedia,whichareavailableasdatabasedumps.</Paragraph> <Paragraph position="2"> Eachentryof the encyclopedia(a pagein the onlineversion)correspondstoa singlelineinthetext file. Eachlineconsistsof an ID (usuallythename of the entity)followedby its description.Thedescriptionpartcontainsthebodyof thetext thatdescribesthe entity. It containsa mixtureof plain text and text with html tags. Referencesto other Wikipediapagesin the text are marked using&quot;[[&quot; &quot;]]&quot; whichcorrespondsto a hyperlinkon the onlineversionof Wikipedia.Mostof the formatting informationwhichis not relevant for the current taskhasbeenremoved.</Paragraph> <Section position="1" start_page="63" end_page="63" type="sub_section"> <SectionTitle> 3.1 Linkswithina singlelanguage </SectionTitle> <Paragraph position="0"> Wikipediais a hypertext documentwitha richlink structure.A descriptionof an entityusuallycontainshypertext linksto otherpageswithinor outside Wikipedia. The majorityof theselinks correspondto entities,which are relatedto the entity being described,and have a separateentry in Wikipedia. Theselinks are used to guidethe readerto a moredetaileddescriptionof the concept denotedby the anchortext. In otherwords, the linksin Wikipediatypicallyindicatea topical associationbetweenthe pages,or ratherthe entitiesbeingdescribedbythepages.E.g.,in describing a particularperson,referencewillbe madeto suchentitiesascountry, organizationandotherimportantentitieswhichare relatedto it and which themselves have entriesin Wikipedia.In general, duetothepeculiarcharacteristicsofanencyclopedia corpus,the hyperlinksfoundin encyclopedia text are used to exemplifythoseinstancesof hyperlinksthatexistamongtopicallyrelatedentities null (Ghaniet al.,2001;RaoandTuroff, 1990).</Paragraph> <Paragraph position="1"> EachWikipediapageis identifiedwitha unique ID. These IDs are formedby concatenatingthe words of the titlesof the Wikipediapageswhich are uniquefor each page, e.g., the page on Vincent van Goghhas &quot;Vincentvan Gogh&quot;as its title and &quot;Vincentvan Gogh&quot;as its ID. Eachpage may, however, be representedby differentanchor textsin a hyperlink.Theanchortextsmaybe simplemorphologicalvariantsof thetitlesuchas plural formor may representcloselyrelatedsemantic concept.For example,theanchortext &quot;Dutch&quot; may point to the page for the Netherlands.In a sense,the IDs functionas the canonicalformfor severalrelatedconcepts.</Paragraph> </Section> <Section position="2" start_page="63" end_page="63" type="sub_section"> <SectionTitle> 3.2 Linksacrossdifferentlanguages </SectionTitle> <Paragraph position="0"> Differentversionsof a pagein differentlanguages are also hyperlinked. For a given page, translationsof itstitlein otherlanguagesforwhichpages existaregiven as hyperlinks.Thispropertyis particularlyusefulforthecurrenttaskas it helpsusto alignthe corpusat the pagelevel. Furthermore,it alsoallows us to inducebilinguallexiconconsisting of the Wikipediatitles. Conceptualmismatch betweenthe pages(e.g. Roof vs Dakconstructie) is rare, and the lexiconis generallyof highquality. Unlike the generallexicon,this lexiconcontains a relatively large numberof namesof individualsand otherentitieswhichare highlyinformative andhenceare usefulin identifyingsimilar text. Thislexiconwill formthe backboneof one of the methodsfor identifyingsimilartext across differentlanguages,aswillbeshowninSection4.</Paragraph> </Section> </Section> <Section position="5" start_page="63" end_page="66" type="metho"> <SectionTitle> 4 Approaches </SectionTitle> <Paragraph position="0"> We describetwo approachesfor identifyingsimilar sentencesacrossdifferentlanguages.Thefirst usesanMTsystemto obtaina roughtranslationof a givenpagein onelanguageintoanotherandthen useswordoverlapbetweensentencesas a similarity measure.Oneadvantageof thismethodis that it relieson a large lexicalresourcewhichis bigger thanwhatcanbe extractedfromWikipedia.However, thetranslationcanbelessaccurateespecially fortheWikipediatitleswhichformpartofthecontentof a pageandareveryinformative. Thesecondapproachrelieson a bilinguallexicon whichis generatedfromWikipediausingthe link structure:pageson the sametopicin different languagesare hyperlinked; see Figure2. We use the titles of the pagesthat are linked in this mannerto createa bilinguallexicon. Thus, our bilinguallexiconconsistsof termsthat represent conceptsor entitiesthathave entriesin Wikipedia, and we will representsentencesby entriesfrom thislexicon:an entryis usedto representthecontent of a sentenceif the sentencecontainsa hypertext link to the Wikipediapage for that entry.</Paragraph> <Paragraph position="1"> Sentencesimilarityis thencapturedintermsofthe sharedlexiconentriesthey share. In otherwords, thesimilaritymeasurethatweusein thisapproach is basedon &quot;concept&quot;or &quot;pagetitle&quot;overlap. Intuitively, this approachhas the advantageof producinga brief but highlyaccuraterepresentation of sentences,moreaccurate,we assumethan the MTapproachas the titlescarryimportantsemantic information;it willalsobe moreaccuratethan the MT approachbecausethe translationsof the titlesaredonemanually.</Paragraph> <Paragraph position="2"> Figure2: Linksto pagesdevotedto thesametopic in otherlanguages.</Paragraph> <Paragraph position="3"> BothapproachesassumethattheWikipediacorpus is aligned at the page level. This is easily achieved usingthe link structuresince,again, pageson thesametopicin differentlanguagesare hyperlinked. This, in turns, narrows down the searchfor similar text to a pagelevel. Hence,for a given text of a page(sentenceor chunk)in one language,we searchfor its equivalent text (sentenceor chunk)onlyin the correspondingpagein theotherlanguage,notin theentirecorpus.</Paragraph> <Paragraph position="4"> We now describethe two approachesin more detail. To remainfocusedand avoid getting lost in technicaldetails, we consideronly two languagesin our technicaldescriptionsand evaluations below: Dutchand English;it will be clear from our presentation,however, that our second approachcanbe usedfor any pairof languagesin Wikipedia.</Paragraph> <Section position="1" start_page="64" end_page="64" type="sub_section"> <SectionTitle> 4.1 AnMTbasedapproach Inthisapproach,wetranslatetheDutchWikipedia </SectionTitle> <Paragraph position="0"> pageintoEnglishusingan onlineMTsystem.We referto the Englishpageas source and the translated(Dutchpage)versionas target. We usedthe BabelfishMT systemof Altavista. It supportsa numberoflanguagepairsamongwhichareDutch-Englishpairs. Afterboth pageshave been made availablein English,we split the pagesinto sentencesortextchucks.We thenlinkeachtextchunk orsentenceinthesourcetoeachchuckorsentence in the target. Followingthiswe computea simple wordoverlapscoreforeachpair. We usedtheJaccardsimilaritymeasurefor this purpose.Content words are our main featuresfor the computation of similarity, hence,weremove stopwords.Grammaticallycorrecttranslationsmay not be necessarysinceweareusingsimplewordoverlapasour null similaritymeasure.</Paragraph> <Paragraph position="1"> The above procedurewill generatea large set of pairs,not all of whichwill actuallybe similar.</Paragraph> <Paragraph position="2"> Filteringworks as follows. First we sort the pairsin decreasingorderof theirsimilarityscores.</Paragraph> <Paragraph position="3"> Thisresultsin a ranked list of text pairsin which the mostsimilarpairsare ranked top whereasthe leastsimilarpairsarerankedbottom.Nextwetake the top mostrankingpair. Sincewe are assuming a one-to-onecorrespondence,weremove all other pairsranked lower in the list containingeitherof thethesentencesor text chunksin thetopranking pair. We thenrepeatthisprocesstakingthesecond toprankingpair. Eachstepresultsin a smallerlist.</Paragraph> <Paragraph position="4"> The processcontinuesuntilthereis no morepair to remove.</Paragraph> </Section> <Section position="2" start_page="64" end_page="66" type="sub_section"> <SectionTitle> 4.2 Usinga link-basedbilinguallexicon </SectionTitle> <Paragraph position="0"> As mentionedpreviously, this approachmakes use of a bilinguallexiconthat is generatedfrom Wikipediausingthe link structure. A high level descriptionof the algorithmis given in Figure3.</Paragraph> <Paragraph position="1"> Below, wefirstdescribehow thebilinguallexicon isacquiredandhowit isusedforenrichingthelink structureof Wikipedia.Finally, we detailhow the tent words from the general vocabulary as features,in thisapproach,weusepagetitlesandtheir translations(asobtainedthroughhyperlinksas explainedabove) as our primitives for the computationof multilingualsimilarity. The first step of thisapproach,then,is acquiringthebilinguallexicon,but thisis relativelystraightforward. Foreach Wikipediapage in one language,translationsof the title in other languages,for which there are separateentries,are given as hyperlinks.Thisinformationis used to generatea bilingualtranslationlexicon. Mostof thesetitlesare contentbearing nounphrasesand are very usefulin multilingual similaritycomputation(Kirk Evans, 2005).</Paragraph> <Paragraph position="2"> Most of these noun phrasesare already disambuiguated,andmayconsistof eithera singleword or multiwordunits.</Paragraph> <Paragraph position="3"> Wikipedia uses a redirectionfacility to map several titles into a canonicalform. Thesetitles are mostly synonymous expressions. We used Wikipedia's redirectfeature to identifysynonymousexpression. null Canonicalrepresentationof a sentence Oncewe have the bilinguallexicon,the next step is to represent thesentencesin bothlanguagepairs usingthislexicon.Eachsentenceis representedby the set of hyperlinksit contains. We searcheach hyperlinkin the bilinguallexicon. If it is found, we replacethe hyperlinkwith the corresponding uniqueidentificationofthebilinguallexiconentry.</Paragraph> <Paragraph position="4"> If it is notfound,thehyperlinkwillbe includedas is as partof the representation.Thisis donesince Dutchand Englishare closelyrelatedlanguages andmaysharemany cognatepairs.</Paragraph> <Paragraph position="5"> EnrichingtheWikipedialinkstructure As describedin the previoussection,the method useshyperlinksin a sentenceas a highlyfocused entity-basedrepresentationof theaboutnessof the sentence. In Wikipedia,not all occurrencesof named-entitiesor conceptsthat have entries in Wikipediaare actuallyused as anchortext of a hypertext link; becauseof this, a numberof sentencesmay needlesslybe left out from the similaritycomputationprocess.In orderto avoid this problem,we automaticallyidentifyotherrelevant hyperlinksusingthebilinguallexicongeneratedin theprevioussection.</Paragraph> <Paragraph position="6"> Identification of additional hyperlinks in Wikipedia sentences works as follows. First we split the sentences into constituentwords.</Paragraph> <Paragraph position="7"> We then generate N gram words keeping the relative orderof wordsin thesentences.Sincethe anchortexts of hypertext linksmaybe multiword expressions,we start with higher order N gram words (N=4). We search these N grams in the bilinguallexicon. If the N gram is foundin the lexicon, it is taken as a new hyperlinkand will formpartof the representationof a sentence.The processis repeatedforlowerorderN grams.</Paragraph> <Paragraph position="8"> Identifyingsimilarsentences Once we are done representingthe sentencesas describedpreviously, the finalstep involves computationof the termoverlapbetweenthe sentence pairsand filteringthe resultinglist. The remaining stepsare similarto thosedescribedin the MT basedapproach.For completeness,we brieflyrepeat the steps here. First, all sentencesfrom a DutchWikipediapageare linked to all sentences of thecorrespondingEnglishWikipediapage.We thencomputethe similaritybetweenthe sentence representations,usingthe Jaccardsimilaritycoefficient. null A sentencein Dutch page may be similar to several sentencesin Englishpage whichmay result in a large numberof spuriouspairs. Therefore,wefilterthelistusingthefollowingrecursive null procedure.First,the sentencepairsare sortedby theirsimilarityscores. We take the pairswiththe highestsimilarityscores. We then eliminateall othersentencepairsfromthe list that containeitherofsentencesinthispair. We continuethisprocesstakingthe secondhighestrankingpair. Note thatthisprocedureassumesa one-to-onematching rule;a sentencesin Dutchcanbe linked to at most onesentencein English.</Paragraph> </Section> </Section> class="xml-element"></Paper>