File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1015_metho.xml

Size: 11,231 bytes

Last Modified: 2025-10-06 14:07:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1015">
  <Title>You'll Take the High Road and I'll Take the Low Road: Using a Third Language to hnprove Bilingual Word Alignment</Title>
  <Section position="3" start_page="97" end_page="98" type="metho">
    <SectionTitle>
3 Improving word alignment by
</SectionTitle>
    <Paragraph position="0"> combining knowledge sources The project in which the research reported here has been carried out, the ETAP project (see section 8, below), is a parallel translation corpus project, the aim of which is to create an annotated--understood as part-of-speech (POS) tagged and aligned--multilingual translation corpus, which will be used as the basis for the development of methods and tools for the automatic extraction of translation equivalents.</Paragraph>
    <Paragraph position="1"> Lately, we have been concentrating on finding good ways to improve word alignment. Tile word alignment system we currently use (which was developed in a sister project in our depamnent, the PLUG project; see Sfigvall Hein (to appear)) works itel'atively with many kinds of information sources, and it seems that this is a good way to proceed.</Paragraph>
    <Paragraph position="2"> Distributional parallelism, coocurrence, string  la,guage-i, dependellt case here. For any particular language pair, language-specific linguistic (and possibly other) information can be used to improve both sentence and word alignment, although the former will probably still stay ahead of the latter in terms of perfor,nance.  similarity (both between and within languages), and part of speech are some of the information sources used, and also (heuristically based) stemming to increase type frequencies for the distributional measures (see, e.g. Tiedemann (to appear a), Tiedemann (to appear b); Molamed (1995), Melamed (1998)). 111 OUl&amp;quot; work in the ETAP proiect we are looking for additional such information sources, and so far we have coricoiltratod our ell:errs ori oxploririg linguistically rich information, such as word similarity (Berth, 1998) and the combination Of word alignlllent and POS taggillg (Borill, to appear a).</Paragraph>
    <Paragraph position="3"> There must certainly exist other sources of information, in addition t:o those mentioned above, lhat carl be used to ilnprovo word alignlnent. This paper discusses one particular such source, namely the use of a third hlnguage in the aligmnent process. Apart fronl an earlier presentation by the present author (Berth, to appear b), I have not seen any mention in the literature of the possibility of using a third language in this way for improving word alignmorit. Simard (1999) describes how the use of a third language can be brought to bear upon the simpler problem of senlence alignment, but he does not consider the harder problem of word alignmenl. Perhaps it has not being thought of for the silnplc reason that it is possible only with ###ulUlingual parallel corpora, and--for obvious reasons--not with b/lingual corpora, which has been the kind of parallel corpus that has received nlost attention from researchers in the field.</Paragraph>
  </Section>
  <Section position="4" start_page="98" end_page="98" type="metho">
    <SectionTitle>
4 Pivot alignment
</SectionTitle>
    <Paragraph position="0"> Since the third language acts as, as it were, a pivot for the alignment of the two other languages, we refer to the method as pivot alignment, and it works as follows, with three languages, e.g. Swedish (SE), Polish (PL) and Serbian-Bosnian-Croatian (SBC), where the aim is to align Swedish with the other two languages on the word level.</Paragraph>
    <Paragraph position="2"> Perform the pairwise alignments SE-~PL, SE--&gt;SBC, PL---&gt;SBC, and SBC-oPL; Check whether there exist aligned words on the indirect 'alignment path '4 SE-oSFJC-oPL, which are not on the direct path SE---&gt;PL. If there are, add them to the SE-oPL alignnaents.</Paragraph>
    <Paragraph position="3"> Do the same for the indirect path SE--&gt;PL-oSBC and the direct path SE-oSBC In order lor this procedure to work, we must  believe that 1. there will be differences in tile SE-+PL and SE--bSBC alignments, and 2. that these dilTerenees will 'survive' the PI,---bSBC and SBC--&gt;PL aligments. 5  Hypothesis (1) seems plausible, since the word alignment system used (Tiedemann (to appear a), Tiedenlann (to appear b)) actually aheady utilizes several kinds of information to align the words in the two texts. In particuhu, it uses distributional information, cooccurrence statistics, iterative size reduction, 'naive' stemming, and string simihuity to select arid rank word alignment carididates (but #*el linear order; cf. also section 3 above). Thus it is fully conceiwtble, e.g., that distributional information will provide one o t' the links and word similarity the other in a three-language path, such as SE--&gt;PL---&gt;SBC, 6 while synonymy or polysemy (i.e., distributional differences; see above) will 4It is this metaphor of the alignments going by different 'paths' or 'roads' to lhe salne goal which has inspired nle lo borrow the firsl part of the title of this paper frolll tile chorus of tile song &amp;quot;Loeb LolllOlld&amp;quot;. 5Incidentally, the indirect path could be extended with lilt)re lallgtlaoes, e.g. Swedish--&gt; Polish--+ E,lglish-o Spanish, etc., but we have not investigated this possibility, although we explore the possibility of using several additional languages in parallel, below. 6This is perhaps intuitively the most likely situation in this particular case, since Polish and Scrbian-Bosnian-Croatian are fairly closely rchlted Slavic languages lhat share many easily recognizable cognates, while both ~.ll'e lllHch lllOrc reinoiely related to  In recent work (Borin, to appear b), we reported on a small preliminary experiment to test the feasibility of the method. We proceeded as follows:</Paragraph>
  </Section>
  <Section position="5" start_page="98" end_page="112" type="metho">
    <SectionTitle>
1. The ETAP IVTI corpus was used for
</SectionTitle>
    <Paragraph position="0"> the experiment. This is a five-language parallel translation corpus of text from the Swedish newspaper for immigrants (Invandrartidningen; the English version is called News and Views). Swedish is the source language, and the other four languages are English (EN), Polish,</Paragraph>
    <Section position="1" start_page="98" end_page="112" type="sub_section">
      <SectionTitle>
Serbian-Bosnian-Croatian and Spanish
</SectionTitle>
      <Paragraph position="0"> alignment directions: SE-+PL, SE-+SBC, PL--+SBC, SBC---,'PL in one group, and SE-+EN, SE-+ES, EN--+ES, ES-+EN in  the other. 500 words were sampled randomly fl'om the Swedish source text, and the standards with Swedish as the source were made manually by me from this sample. The target units of these</Paragraph>
      <Paragraph position="2"> standards were then used as the basis for the manual establishment (again by me) of the various target language alignment evaluation standards. Because of null links, misaligned or differently aligned sentences, etc., the size of the evaluation standards varied fi'om 366 to 500 words; In addition to the already word aligned SE--+{EN,ES,PL, SBC}, we aligned the other language pairs necessary for the experiment; The evaluation function in the aligmnent system was used to calculate recall and precision for each word alignment. In addition to this, we manually extracted the additional links, if any, that would be found on the indirect path through the third language.</Paragraph>
      <Paragraph position="3"> The null links mentioned in (2) above were largely due to the sampling procedure choosing many function words, which often (also in this case) are troublesome in the context of finding good translation equivalents, since they may not correspond to words in the TL (see section 2 above).</Paragraph>
      <Paragraph position="4"> The results of the preliminary experiment are shown in Table 1.</Paragraph>
      <Paragraph position="5"> We see that only a few units survived the trip through two languages, but out of those that did, most contributed positively to the total result. SE-+ES and SE-+PL were the alignments which benefitted most from pivot</Paragraph>
      <Paragraph position="7"> (null links i J1 stamlard not counted,&amp;quot; correct and partly correct lillk.s&amp;quot; counted together) alignment (through EN and SBC, respectively), while the result wets insignificant for SE-+SBC and perhaps even detrimental in the case of SE--+EN.</Paragraph>
      <Paragraph position="8"> We saw these results as suggestive, rather than conclusive. It certainly seemed that the closer genetic relatedness of the two Shtvic languages worked to our advantage, but we concluded that we needed to do more experiments, bolh with more language combinations and with a modilied sampling procedure. In pmticular, we wanted to get rid o1' the problematic function words (see above). Since the recall is faMy low to start with, even a few correct additional alignments mean a great deal for the overall performance of the word alignment system. Thus, we thought that this approach would be worth pursuing t'urther. 6 A new experiment with pivot alignment To coufirm these results, we redesigned slightly and extended our experimental procedure, in tile following way. A new sampling o1' the same corpus was performed, but this time we lirst constructed a stop word list consisting of the 50 most frequent word types in the Swedish part of the IVT 1 corpus, as a language-independent way of approximating the set o1' function words in the language. Thus, we had a new sample, with more content words, to compare with the previous one, tile hyt~othesis being that a larger percentage of content words would be able to contribute more links in the pivot alignment process.</Paragraph>
      <Paragraph position="9"> We also added some new hmguage combinatious, so that we now would be able to whether there is a difference in using Spanish as a pivot in aligning Swedish and English, as opposed to using Polish. We also investigated what the result would be of using more than one additional language in parallel. The new pivot alignment paths investigated (in addition to the ones investigated in the lirst experiment) are represented by the following  The hypothesis wets that the new setup would make the possible effect of close genetic relatedness more discernible, which indeed seems to be the case (see below).</Paragraph>
      <Paragraph position="10"> The results of the new experiment are shown in Table 2. We see that  * initial (non-pivot alignment) recall has gone up quite a bit, presumably because function words have been avoided in the standard; * initial alignment precision still remains at the same high level as before; * all but two of the alignments added by pivot alignment are correct, i.e. recall is raised without a decrease in precision; * difl~rent pivot languages add different alignments, i.e. there seems to be a cunmlative positive effect fiom adding more languages; * the degree of relatedness of the languages in a triad seems to play a role for how well pivot alignment will work for the particular triad.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML