File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1149_metho.xml

Size: 18,069 bytes

Last Modified: 2025-10-06 14:08:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1149">
  <Title>Integrating Cross-Lingually Relevant News Articles and Monolingual Web Documents in Bilingual Lexicon Acquisition Takehito Utsuro +</Title>
  <Section position="4" start_page="3" end_page="910" type="metho">
    <SectionTitle>
EJ
</SectionTitle>
    <Paragraph position="0"> .First, from a pseudo-parallel sentence pair d</Paragraph>
    <Paragraph position="2"> we extract monolingual (possibly compound</Paragraph>
    <Paragraph position="4"> Then, based on the contingency table of co-occurrence document frequencies of t</Paragraph>
    <Paragraph position="6"> below, we estimate bilingual term correspondences according to the statistical measures such as the mutual information, the ph  statistic, the dice coefficient, and the log-likelihood ratio.</Paragraph>
    <Paragraph position="8"> We compare the performance of those four measures, where the ph  statistic and the log-likelihood ratio perform best, the dice coefficient the second best, and the mutual information the worst. In section 4.3, we show results with the ph  statistic as the bilingual term correspondence</Paragraph>
    <Paragraph position="10"/>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> This section illustrates the overview of the process of re-estimating bilingual term correspondences using monolingual Web documents collected by search engines. Figure 2 gives its rough idea.</Paragraph>
      <Paragraph position="1"> has been well studied in the context of bilingual lexicon acquisition from comparable corpora. In this method, we regard cross-lingually relevant texts as a comparable corpus, where bilingual term correspondences are estimated in terms of contextual similarities across languages. This technique is less effective than the one we describe here (Utsuro et al., 2003).</Paragraph>
      <Paragraph position="2">  In the evaluation of this paper, we restrict English and Japanese terms t</Paragraph>
      <Paragraph position="4"> to be up to five words  Suppose that we have an English term, and that the problem to solve here is to find its Japanese translation. As we described in the previous section and in Figure 1, with a cross-lingually relevant Japanese and English news articles database, we can have a certain number of Japanese translation candidates for the target English term. Here, for high frequency terms, it is relatively easier to have reliable ranking of those Japanese translation candidates. However, for low frequency terms, having reliable ranking of those Japanese translation candidates is difficult. Especially, low frequency problem of this type often happens when we do not have large enough language resources (in this case, cross-lingually relevant news articles). null Considering such a situation, re-estimation of bilingual term correspondences proceeds as follows, using much larger monolingual Web documents sets that are easily accessible through search engines. First, English pages which contain the target English term are collected through an English search engine. In the similar way, for each Japanese term in the Japanese translation candidates, Japanese pages which contain the Japanese term are collected through a Japanese search engine. Then, texts contained in those English and Japanese pages are extracted and are regarded as comparable corpora. Here, a standard technique of estimating bilingual term correspondences from comparable corpora (e.g., Fung and Yee (1998) and Rapp (1999)) is employed. Contextual similarity between the target English term and the Japanese translation candidate is measured across languages, and all the Japanese translation candidates are re-ranked according to the contextual similarities.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.2 Filtering by Hits of Search En-
gines
</SectionTitle>
      <Paragraph position="0"> Before re-estimating bilingual term correspondences using monolingual Web documents, we assume there exists certain correlation between hits of the English term t</Paragraph>
      <Paragraph position="2"> returned by search engines. Depending on the hits h(t</Paragraph>
      <Paragraph position="4"> to be within the range of a lower bound</Paragraph>
      <Paragraph position="6"> As search engines, we used AltaVista (http://www. altavista.com/ for English, and goo (http://www.goo.ne.jp/)for Japanese. With a development data set consisting of translation pairs of an English term and a Japanese term, we manually constructed the following rules for determining the lower  In the experimental evaluation of Section 4.4, the initial set of Japanese translation candidates consists of 50 terms for each English term, which are then reduced to on the average 24.8 terms with this filtering.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
3.3 Re-estimating Bilingual Term
Correspondences based on Con-
textual Similarity
</SectionTitle>
      <Paragraph position="0"> This section describes how to re-estimate bilingual term correspondences using monolingual Web documents collected by search engines.</Paragraph>
      <Paragraph position="1"> For an English term t</Paragraph>
      <Paragraph position="3"> ) be the sets of documents returned by search engines with queries</Paragraph>
      <Paragraph position="5"> is constructed as below: each English sen- null In the term frequency vectores, compound terms are restricted to be up to five words long both for English and Japanese.</Paragraph>
      <Paragraph position="6">  In the translation of English sentences into Japanese, we evaluated an MT software and a bilingual lexicon in terms of the performance of re-estimation of bilingual term correspondences. Unlike the situation of cross-lingually relevant news articles mentioned in Section 2.2, translation by a bilingual lexicon is more effective for monolingual Web documents. In the case of monolingual Web documents, it is much less expected to find closely related documents in the other language. In such cases, multiple translation rather than exact translation by an MT software is suitable. In Section 4.4, we show evaluation results with translation by a bilingual lexicon</Paragraph>
    </Section>
    <Section position="4" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.1 Japanese-English Relevant News
</SectionTitle>
      <Paragraph position="0"> Articles on Web News Sites We collected Japanese and English news articles from a Web news site. Table 1 shows the total number of collected articles and the range of dates of those articles represented as the number of days. Table 1 also shows the number of articles updated in one day, and the average article size. The number of Japanese articles updated in one day are far greater (about 4 times) than that of English articles.</Paragraph>
      <Paragraph position="1">  Next, for several lower bounds L d of the similarity between English and Japanese articles, Table 2 shows the numbers of English and Japanese articles as well as article pairs which satisfy the similarity lower bound. Here, the difference of dates of English and Japanese articles is within two days, with which it is guaranteed that, if exist, closely related articles in the other language can be discovered (see Utsuro et al. (2003) for details). Note that it can happen that one article has similarity values above the lower bound against more than one articles in the other language.</Paragraph>
      <Paragraph position="2"> According to our previous study (Utsuro et al., 2003), cross-lingually relevant news articles are available in the direction of English-to-Japanese retrieval for more than half of the retrieval query English articles. Furthermore, with the similarity lower bound L d =0.3, precision and recall of cross-language retrieval are around 30% and 60%, respectively. Therefore, with the similarity lower bound L d =0.3, at least 1,800 ([?] 6,073x0.5x0.6) English articles have relevant Japanese articles in the results of cross-language retrieval. Based on this analysis, the next section gives evaluation results with the similarity lower bound L d =0.3.</Paragraph>
    </Section>
    <Section position="5" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
4.2 English Term List for Evaluation
</SectionTitle>
      <Paragraph position="0"> For the evaluation of this paper, we first manually select target English terms and their reference Japanese translation, and examine whether reference bilingual term correspondences can be estimated by the methods presented in Sections 2 and 3. Target English terms are selected by the following procedure.</Paragraph>
      <Paragraph position="1"> First, from the whole English articles of Table 1, any sequence of more than one words whose frequency is more than or equal to 10 is enumerated. This enumeration is easily implemented and efficiently computed by employing the technique of PrefixSpan (Pei et al., 2001).</Paragraph>
      <Paragraph position="2"> Here, certain portion of those word sequences are appropriate as compound terms, while the rest are some fragments of a compound term, or concatenation of those fragments. In order to automatically select candidates for correct compound terms, we parse those word se- null , and collect noun phrases which consist of adjectives, nouns, and present/past participles. For each of those word sequences, the ph  statistic against Japanese translation candidates is calculated, then those word sequences are sorted in descending order of their ph  statistic. Finally, among top 3,000 candidates for compound terms, 100 English compound terms are randomly selected for the evaluation of this paper. Selected 100 terms satisfy the following condition: those English terms can be correctly translated neither by the MT software used in Section 2.2, nor by the bilingual lexicon used in Section 3.3.</Paragraph>
    </Section>
    <Section position="6" start_page="7" end_page="8" type="sub_section">
      <SectionTitle>
4.3 Estimating Bilingual Term Cor-
</SectionTitle>
      <Paragraph position="0"> respondences with News Articles For the 100 English terms selected in the previous section, Japanese translation candidates which satisfy the condition of the formula (2) in Section 2.3 are collected, and are ranked according to the ph  statistic. Figure 3 plots the rate of reference Japanese translation being within top n candidates. In the figure, the plot labeled as &amp;quot;full&amp;quot; is the result with the whole articles in Table 1. In this case, the accuracy of the top ranked Japanese translation candidate is about 40%, and the rate of reference Japanese translation within top five candidates is about 75%.  On the other hand, other plots labeled as &amp;quot;Freq=x, y days&amp;quot; are the results when the number of the news articles is reduced, which are simulations for estimating bilingual term correspondences for low frequency terms. Here, the label &amp;quot;Freq=x, y days&amp;quot; indicates that news articles used for ph  statistic estimation is restricted to certain portion of the whole news articles so that the following condition be satisfied: i) co-occurrence document frequency of a target English term and its reference Japanese translation is fixed to be x,  ii) the number of days be greater than or equal to y.Foreach news articles data set, Table 3 shows document  and the numbers of days for English as well as Japanese articles. Those numbers are all averaged over the 100 English terms. The number of days for Japanese articles could be at maximum five times larger than that for English articles, because relevant Japanese articles are retrieved against a query English article from the dates of differences within two days (details are in Sections 2.2 and 4.1).</Paragraph>
      <Paragraph position="1"> As can be seen from the plots of Figure 3, the smaller the news articles data set, the lower the plot is. Especially, in the case of the smallest news articles data set, it is clear that reliable ranking of Japanese translation candidates is difficult. This is because it is not easy to discriminate the reference Japanese translation and the other candidates with statistics obtained from such a small news articles data set.</Paragraph>
    </Section>
    <Section position="7" start_page="8" end_page="910" type="sub_section">
      <SectionTitle>
4.4 Re-estimating Bilingual Term
Correspondences with Monolin-
gual Web Documents
</SectionTitle>
      <Paragraph position="0"> For the 100 target English terms evaluated in the previous section, this section describes the result of applying the technique presented in Section 3.3, i.e., re-estimating bilingual term  ments. For each of the 100 target English terms, bilingual term correspondences are re-estimated against candidates of Japanese translation ranked within top 50 according to the ph  statistic. Here, as a simulation for terms that are infrequent in news articles, 50 candidate terms for Japanese translation are collected from the smallest data set labeled as &amp;quot;Freq=10, 13.6 days&amp;quot;. As mentioned in Section 3.2, those 50 candidates are reduced to on the average 24.8 terms with the filtering by hits of search engines. For each of an English term</Paragraph>
      <Paragraph position="2"> and a Japanese term t J , 100 monolingual documents are collected by search engines  .</Paragraph>
      <Paragraph position="3"> Figure 4 compares the plots of re-estimation with monolingual Web documents and estimation by news articles (data set &amp;quot;Freq=10, 13.6  In the result of our preliminary evaluation, accuracy of re-estimating bilingual term correspondences did not improve even if more than 100 documents were used.  Alternatively, as the monolingual documents from which contextual vectors are constructed, we evaluated each of the short passages listed in the summary pages returned by search engines, instead of the whole documents of the URLs listed in the summary pages. The difference of the performance of bilingual term correspondence estimation is little, while the computational cost can reduced to almost 5%.</Paragraph>
      <Paragraph position="4"> days&amp;quot;). It is clear from this result that mono-lingual Web documents contribute to improving the accuracy of estimating bilingual term correspondences for low frequency terms.</Paragraph>
      <Paragraph position="5"> One of the major reasons for this improvement is that topics of monolingual Web documents collected through search engines are much more diverse than those of news articles. Such diverse topics help discriminate correct and incorrect Japanese translation candidates. For example, suppose that the target English term t E is &amp;quot;special anti-terrorism law&amp;quot; and its reference Japanese translation is &amp;quot;0f O&amp;quot;. In the news articles we used for evaluation, most articles in which t  &amp;quot;dispatch of Self-Defense Force for reconstruction of Iraq&amp;quot; as their topics. Here, Japanese translation candidates other than &amp;quot;0f O&amp;quot; that are highly ranked according to the ph  statistic are: e.g., &amp;quot;:r(dissolution of the House of Representatives)&amp;quot; and &amp;quot; (assistance for reconstruction of Iraq)&amp;quot;, which frequently appear in the topic of &amp;quot;dispatch of Self-Defense Force for reconstruction of Iraq&amp;quot;.</Paragraph>
      <Paragraph position="6"> On the other hand, in the case of monolingual Web documents collected through search engines, it can be expected that topics of documents may vary according to the query terms. In the case of the example above, the major topic is &amp;quot;dispatch of Self-Defense Force for reconstruction of Iraq&amp;quot; for both of reference terms</Paragraph>
      <Paragraph position="8"> , while major topics for other Japanese translation candidates are: &amp;quot;issues on Japanese Diet&amp;quot; for &amp;quot;:r(dissolution of the House of Representatives)&amp;quot; and &amp;quot;issues on reconstruction of Iraq, not only in Japan, but all over the world&amp;quot; for &amp;quot;(assistance for reconstruction of Iraq)&amp;quot;. Those topics of incorrect Japanese translation candidates are differ- null their contextual vector similarities against the target English term t E are relatively low compared with the reference Japanese translation</Paragraph>
      <Paragraph position="10"> . Consequently, the reference Japanese translation t J is re-ranked higher compared with the ranking based on news articles.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="910" end_page="910" type="metho">
    <SectionTitle>
5 Related Works
</SectionTitle>
    <Paragraph position="0"> In large scale experimental evaluation of bilingual term correspondence estimation from comparable corpora, it is difficult to estimate bilingual term correspondences against every possible pair of terms due to its computational complexity. Previous works on bilingual term correspondence estimation from comparable corpora controlled experimental evaluation in various ways in order to reduce this computational complexity. For example, Rapp (1999) filtered out bilingual term pairs with low monolingual frequencies (those below 100 times), while Fung and Yee (1998) restricted candidate bilingual term pairs to be pairs of the most frequent 118 unknown words. Cao and Li (2002) restricted candidate bilingual compound term pairs by consulting a seed bilingual lexicon and requiring their constituent words to be translation of each other across languages. On the other hand, in the framework of bilingual term correspondences estimation of this paper, the computational complexity of enumerating translation candidates can be easily avoided with the help of cross-language retrieval of relevant news texts.</Paragraph>
    <Paragraph position="1"> Furthermore, unlike Cao and Li (2002), bilingual term correspondences for compound terms are not restricted to compositional translation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML