File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1008_concl.xml

Size: 1,756 bytes

Last Modified: 2025-10-06 13:55:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1008">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A fast and accuratemethodfor detectingEnglish-Japaneseparalleltexts</Title>
  <Section position="7" start_page="65" end_page="66" type="concl">
    <SectionTitle>
5 Summaryand Future Work
</SectionTitle>
    <Paragraph position="0"> In this paper, we proposed a fast and accurate method for detecting parallel texts from a collection. This method consists of major three parts;preprocessa bilingualdictionaryintoword-ID conversion rule, convert texts into ID sequences, comparesequences. With this method, we achieved 250,000 pairs/sec on a single CPU and best F1 score of 0.960. In addition, this method utilizes only linguistic informationof a textual content so that it is generallyapplicable.</Paragraph>
    <Paragraph position="1"> Thismeansit candetectparalleldocumentsin any format. Furthermore,our methodis independent on languagesin essence. It can be appliedto any pairof languagesif a bilingualdictionarybetween the languages are available (a general language dictionarysuffices.) Our future study will include improving both accuracy and speed while retainingthe generaility. For accuracy, as we describedin Section4.5, tscoretendsto increasewhenanidenticalsemantic ID appearsmany timesin a text. We mightbe able to deal with this problemby taking into account the probabilitythat the distancebetweenwords is within a threshold. Large connectedcomponents were partitionedby a very simple method at the present work. More involved partitioningmethods may improve the accuracy of the judgement. For speed,reducingthe numberof comparisonsis the mostimportantissuethat needsbe addressed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML