File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2810_evalu.xml

Size: 4,946 bytes

Last Modified: 2025-10-06 13:59:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2810">
  <Title>FindingSimilarSentencesacrossMultipleLanguagesin Wikipedia</Title>
  <Section position="6" start_page="66" end_page="67" type="evalu">
    <SectionTitle>
5 ExperimentalEvaluation
</SectionTitle>
    <Paragraph position="0"> Now that we have describedthe two algorithms for identifyingsimilarsentences,we returnto our researchquestions. In order to answerthem we runtheexperimentdescribedbelow.</Paragraph>
    <Section position="1" start_page="66" end_page="66" type="sub_section">
      <SectionTitle>
5.1 Set-up
</SectionTitle>
      <Paragraph position="0"> We took a randomsampleof 30 English-Dutch Wikipediapagepairs. Eachpageis splitintosentences. We generatedcandidateDutch-English sentencepairs and passed them on to the two methods.Bothmethodsreturna rankedlistofsentencepairs that are similar. As explainedabove, weassumeda one-to-onecorrespondence,i.e.,one Englishsentencecan be linked to at mostto one Dutchsentence.</Paragraph>
      <Paragraph position="1"> Theoutputsof the systemsare manuallyevaluated. We applya relatively lenientcriteriain assessingthe results. If two sentencesoverlap interms of their informationcontentthen we consider them to be similar. This includescases in whichsentencesmaybe exacttranslationof each other, one sentencemay be containedwithinanother, or bothsharesomebitsof information.</Paragraph>
    </Section>
    <Section position="2" start_page="66" end_page="67" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> Table1 shows the resultsof the two methodsdescribedin Section4. In the table, we give two types of numbersfor each of the two methods MT andBilinguallexicon: Total(thetotalnumber of sentencepairs)and Match (the numberof correctlyidentifiedsentencepairs)generatedby the two approaches.</Paragraph>
      <Paragraph position="1"> Overall, the two approachestend to produce  similarnumbersofcorrectlyidentifiedsimilarsentence pairs. The systemsseem to performwell on pageswhichtend to be alignableat sentence level, i.e., parallel. This is clearly seen on the followingpages: PierluigiCollina, Marcus CorneliusFronto, George F. Kennan, whichshow a highsimilarityat sentencelevel. Somepagescontain very smalldescriptionand hencethe figures for correctsimilarsentencesare alsosmall.Other topics such as Classicism(Dutch: Classicisme), Tennis, and Tank, thoughthey aredescribedin sufficientdetailsin bothlanguages,theretendsto be lessoverlapamongthe text. Themethodstendto retrieve more accuratesimilarpairs from person  pagesthanotherpagesespeciallythosepagesdescribinga moreabstractconcepts.However, this needsto be testedmorethoroughly.</Paragraph>
      <Paragraph position="2"> Whenwe look at the total numberof sentence pairs returned,we noticethat the bilinguallexicon based methodconsistentlyreturnsa smaller amount of similar sentencepairs which makes the methodmoreaccuratethanthe MT basedapproach. On average,the MT basedapproachreturns4.5(26%)correctsentencesandthebilingual null lexicon based approachreturns 2.9 correct sentences(45%). But, on average,the MT approach returnsthreetimesasmany sentencepairsasbilinguallexiconapproach.Thismaybedueto thefact that the formermakes use of restrictedset of importanttermsor conceptswhereasthe lateruses a large generallexicon. Thoughwe remove some of the mostfrequentlyoccuringstopwords in the MTbasedapproach,it stillgeneratesa large numberof incorrect similarsentencepairsdueto some commonwords.</Paragraph>
      <Paragraph position="3"> In general,the numberof correctlyidentified similar pages extractedseems small. However, most of the Dutch pages are relatively small, which sets the upper bound on the number of correctlyidentifiedsentencepairsthat can be extracted. On average,eachDutchWikipediapage in the samplecontains18 sentenceswhereasEnglishWikipediapagescontain65 sentences.Excludingthe pages for Tennis, Tank (Dutch: voertuig), and Tricolor, whichare relatively large, eachDutchpagecontainson average8 sentences, which is even smaller. Given the fact that the pages are in general not parallel, the methods, using simple heuristics, identifiedhigh quality translationequivalent sentencepairs from most Wikipediapages. Furthermore,a close examinationof theoutputof thetwo approachesshow that both tend to identifythe sameset of similarsentencepairs. null We ranourbilinguallexiconbasedapproachon the wholeDutch-EnglishWikipediacorpus. The methodreturnedabout80M of candidatesimilar sentences.Thoughwe do not have the resources to evaluate this output, the results we got from sampledata (cf. Table 1) suggestthat it contains a significantamountof correctlyidentifiedsimilar  of correctlyidentifiedsimilarsentencepairs(column4) returnedby the MT basedapproach.The total numberof sentencepairs(column5) and the numberof correctlyidentifiedsimilarsentencepairs (column6) returnedby themethodusinga bilinguallexicon. sentences.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML