File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2025_metho.xml

Size: 8,948 bytes

Last Modified: 2025-10-06 14:08:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2025">
  <Title>Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Proposed Translation Model in
CLIR
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the overall design of the proposed translation model in CLIR consisting of three main parts as follows: - Bilingual terminology acquisition from bi-directional comparable corpora, completed through a two-stages term-by-term translation model.</Paragraph>
    <Paragraph position="1"> - Linguistic-based pruning, which is applied on the extracted translation alternatives in order to filter and detect terms and their translations that are morphologically close enough, i.e., with close or similar part-of-speech tags.</Paragraph>
    <Paragraph position="2"> - Phrasal translation, completed on the source query after re-scoring the translation alternatives related to each source query term. The proposed re-scoring techniques are based on the World Wide Web (WWW), a large-scale test collection such as NTCIR, the comparable corpora or a possible interaction with the user, among others.</Paragraph>
    <Paragraph position="3"> Finally, a linear combination to bilingual dictionaries, bilingual thesauri and transliteration for the special phonetic alphabet of foreign words and loanwords, would be possible depending on the cost and availability of linguistic resources.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Two-stages Comparable Corpora-based
Approach
</SectionTitle>
      <Paragraph position="0"> The proposed two-stages approach on bilingual terminology acquisition and disambiguation from comparable corpora (Sadat et al., 2003) is described as follows: - Bilingual terminology acquisition from source language to target language to yield a first translation model, represented by similarity vectors SIMS!T .</Paragraph>
      <Paragraph position="1"> - Bilingual terminology acquisition from target language to source language to yield a second translation model, represented by similarity vectors SIMT!S.</Paragraph>
      <Paragraph position="2"> - Merge the first and second models to yield a two-stages translation model, based on bi-directional comparable corpora and represented by similarity vectors SIM(S$T .</Paragraph>
      <Paragraph position="3"> We follow strategies of previous researches (Dejean et al., 2002; Fung, 2000; Rapp, 1999) for the first and second models and propose a merging and disambiguation process for the two-stages translation model. Therefore, context vectors of each term in source and target languages are constructed following a statistics-based metric. Next, context vectors related to source words are translated using a preliminary bilingual seed lexicon. Similarity vectors SIMS!T and SIMT!S related to the first and second models respectively, are constructed for each pair of source term and target translation using the cosine metric.</Paragraph>
      <Paragraph position="4"> The merging process will keep common pairs of source term and target translation (s,t) which appear in SIMS!T as (s,t) but also in SIMT!S as (t,s), to result in combined similarity vectors SIMS$T for each pair (s,t).The product of similarity values in vectors SIMS!T and SIM(T!S will yield similarity values in SIMS$T for each pair (s,t) of source term and target translation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Linguistics-based Pruning
</SectionTitle>
      <Paragraph position="0"> Morphological knowledge such as Part-of-Speech (POS), context of terms extracted from thesauri could be valuable to filter and prune the extracted translation candidates. POS tags are assigned to each source term (Japanese) via morphological analysis. null</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> As well, a target language morphological analysis will assign POS tags to the translation candidates. We restricted the pruning technique to nouns, verbs, adjectives and adverbs, although other POS tags could be treated in a similar way. For Japanese-English pair of languages, Japanese nouns and verbs are compared to English nouns and verbs, respectively. Japanese adverbs and adjectives are compared to English adverbs and adjectives, because of the close relationship between adverbs and adjectives in Japanese (Sadat et al., 2003).</Paragraph>
      <Paragraph position="5"> Finally, the generated translation alternatives are sorted in decreasing order by similarity values and rank counts are assigned in increasing order. A fixed number of top-ranked translation alternatives are selected and misleading candidates are discarded.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Phrasal Translation
</SectionTitle>
      <Paragraph position="0"> Query translation ambiguity can be drastically mitigated by considering the query as a phrase and restricting the single term translation to those candidates that were selected by the proposed combined statistics-based and linguistics-based approach (Sadat et al., 2003). Therefore, after generating a ranked list of translation candidates for each source term, re-scoring techniques are proposed to estimate the coherence of the translated query and decide the best phrasal translation.</Paragraph>
      <Paragraph position="1"> Assume a source query Q having n terms fs1 . . .sng. Phrasal translation of the source query Q is completed according to the selected top-ranked translation alternatives for each source term si and a re-scoring factor RFk, as follows:</Paragraph>
      <Paragraph position="3"> Where, Qk(s1::sn) represents the phrasal translation candidate associated to rank k. The re-scoring factor RFk(t1::tn; s1::sn) is estimated using one of the re-scoring techniques, described below.</Paragraph>
      <Paragraph position="4"> Re-scoring through the WWW The WWW can be considered as an exemplar linguistic resource for decision-making (Grefenstette, 1999). In the present study, the WWW is exploited in order to re-score the set of translation candidates related to the source terms.</Paragraph>
      <Paragraph position="5"> Sequences of all possible combinations are constructed between elements of sets of highly ranked translation alternatives. Each sequence is sent to a popular Web portal (here, Google) to discover how often the combination of translation alternatives appears. Number of retrieved WWW pages in which the translated sequence occurred is used to represent the re-scoring factor RF for each sequence of translation candidates. Phrasal translation candidates are sorted in decreasing order by re-scoring factors RF .</Paragraph>
      <Paragraph position="6"> Finally, a number (thres) of highly ranked phrasal translation sequences is selected and collated into the final phrasal translation.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Re-scoring through a Test Collection
</SectionTitle>
      <Paragraph position="0"> Large-scale test collections could be used to re-score the translation alternatives and complete a phrasal translation. We follow the same steps as the WWW-based technique, replacing the WWW by a test collection and a retrieval system to index documents of the test collection.</Paragraph>
      <Paragraph position="1"> NTCIR test collection (Kando, 2001) could be a a good alternative for Japanese-English language pair, especially if involving the comparable corpora.</Paragraph>
      <Paragraph position="2"> Re-scoring through the Comparable Corpora Comparable corpora could be considered for the disambiguation of translation alternatives and thus selection of best phrasal translations (Sadat et al., 2002). Our proposed algorithm to estimate the re-scoring factor RF , relies on the source and target language parts of the comparable corpora using statistics-based measures. Co-occurrence tendencies are estimated for each pair of source terms using the source language text and each pair of translation alternatives using the target language text. Re-scoring through an Interactive Mode An interactive mode (Ogden and Davis, 2000) could help solve the problem of phrasal translation.</Paragraph>
      <Paragraph position="3"> The interactive environment setting should optimize the phrasal translation, select best phrasal translation alternatives and facilitate the information access across languages. For instance, the user can access a list of all possible phrases ranked in a form of hierarchy on the basis of word ranks associated to each translation alternative. Selection of a phrase will modify the ranked list of phrases and will provide an access to documents related to the phrase.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML