File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0319_intro.xml

Size: 2,691 bytes

Last Modified: 2025-10-06 14:01:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0319">
  <Title>An LSA Implementation Against Parallel Texts in French and English</Title>
  <Section position="3" start_page="1" end_page="1" type="intro">
    <SectionTitle>
2. Introduction
</SectionTitle>
    <Paragraph position="0"> LSA is an analytical methodology that uses mathematical procedures and vector space modeling  At present, the collection of texts used consists of 30 English and 30 French documents, all perfectly mated. Cognates were not distinguished between languages, e.g., revolution is counted as both a French and an English term. techniques to generate an abstract, numerical representation of the relationships among words and documents in a collection of texts (the corpus). In this analysis, the methodology is used to identify the symmetry that exists among the pattern of relationships and associations in the parallel texts. Where texts are perfectly aligned, it is expected that for every occurrence of a word in one language, an exact correspondent exists in the other language.</Paragraph>
    <Paragraph position="1"> However, the analysis shows that even in a perfectly aligned corpus, the word distributions between the two languages deviate and a one-to-one association does not exist.</Paragraph>
    <Paragraph position="2"> An example of how word symmetry breaks down in parallel texts can be seen in two &amp;quot;sets&amp;quot; of parallel documents, F1-E1 and F2-E2. In these paired documents, the cross-language term correspondence between the French term &amp;quot;je&amp;quot; and the English term &amp;quot;I&amp;quot; in the two sets shows that in the first pair, &amp;quot;je&amp;quot; occurs 42 times and &amp;quot;I&amp;quot; occurs only 37 times. In the second pair, &amp;quot;je&amp;quot; occurs 59 times in the French document and 62 times in the English document. Such differences in word usage patterns between corresponding terms are very common and create difficulties for the MT or TA tasks.</Paragraph>
    <Paragraph position="3"> Because of the way LSA represents word-usage associations and patterns among documents and terms, it may have much to offer in understanding the difficulty levels of these tasks. This analysis shows that, in spite of usage differences resulting in non-symmetrical cross-language word distributions between the corresponding terms of any given language pair, the LSA methodology is capable of identifying the appropriate usage pattern for each of the terms, within its own language. This paper presents a first look at the alignment patterns found in a parallel corpus and how LSA may offer some insights into the MA and TA tasks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML