File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1067_intro.xml

Size: 2,521 bytes

Last Modified: 2025-10-06 14:02:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1067">
  <Title>A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Comparable corpora contain texts written in different languages that, roughly speaking, &amp;quot;talk about the same thing&amp;quot;. In comparison to parallel corpora, ie corpora which are mutual translations, comparable corpora have not received much attention from the research community, and very few methods have been proposed to extract bilingual lexicons from such corpora. However, except for those found in translation services or in a few international organisations, which, by essence, produce parallel documentations, most existing multilingual corpora are not parallel, but comparable. This concern is reflected in major evaluation conferences on cross-language information retrieval (CLIR), e.g. CLEF1, which only use comparable corpora for their multi-lingual tracks.</Paragraph>
    <Paragraph position="1"> We adopt here a geometric view on bilingual lexicon extraction from comparable corpora which allows one to re-interpret the methods proposed thus far and formulate new ones inspired by latent semantic analysis (LSA), which was developed within the information retrieval (IR) community to treat synonymous and polysemous terms (Deerwester et al., 1990). We will explain in this paper the motivations behind the use of such methods for bilingual lexicon extraction from comparable corpora, and show how to apply them. Section 2 is devoted to the presentation of the standard approach, ie the approach adopted by most researchers so far, its geometric interpretation, and the unresolved synonymy  and polysemy problems. Sections 3 to 4 then describe three new methods aiming at addressing the issues raised by synonymy and polysemy: in section 3 we introduce an extension of the standard approach, and show in appendix A how this approach relates to the probabilistic method proposed in (Dejean et al., 2002); in section 4, we present a bilingual extension to LSA, namely canonical correlation analysis and its kernel version; lastly, in section 5, we formulate the problem in terms of probabilistic LSA and review different associated similarities. Section 6 is then devoted to a large-scale evaluation of the different methods proposed. Open issues are then discussed in section 7.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML