File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0810_metho.xml
Size: 7,210 bytes
Last Modified: 2025-10-06 14:09:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0810"> <Title>NUKTI: English-Inuktitut Word Alignment System Description</Title> <Section position="4" start_page="75" end_page="76" type="metho"> <SectionTitle> 3 NUKTI: Word and Substring Alignment </SectionTitle> <Paragraph position="0"> Martin et al. (2003) documented a study in building and using an English-Inuktitut bitext. They described a sentence alignment technique tuned for the specificity of the Inuktitut language, and described as well a technique for acquiring correspondent pairs of English tokens and Inuktitut substrings. The motivation behind their work was to populate a glossary with reliable such pairs.</Paragraph> <Paragraph position="1"> We extended this line of work in order to achieve word alignment.</Paragraph> <Section position="1" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 3.1 Association Score </SectionTitle> <Paragraph position="0"> As Martin et al. (2003) pointed out, the strong agglutinative nature of Inuktitut makes it necessary to consider subunits of Inuktitut tokens. This is reflected by the large proportion of token types and hapax words observed on the Inuktitut side of the training corpus, compared to the ratios observed on the English side (see table 3).</Paragraph> <Paragraph position="1"> in the TRAIN corpus.</Paragraph> <Paragraph position="2"> The main idea presented in (Martin et al., 2003) is to compute an association score between any English word seen in the training corpus and all the Inuktitut substrings of those tokens that were seen in the same region. In our case, we computed a likelihood ratio score (Dunning, 1993) for all pairs of English tokens and Inuktitut substrings of length ranging from 3 to 10 characters. A maximum of 25 000 associations were kept for each English word (the top ranked ones).</Paragraph> <Paragraph position="3"> To reduce the computation load, we used a suffix tree structure and computed the association scores only for the English words belonging to the test corpus we had to align. We also filtered out Inuktitut substrings we observed less than three times in the training corpus. Altogether, it takes about one hour for a good desktop computer to produce the association scores for one hundred English words.</Paragraph> <Paragraph position="4"> We normalize the association scores such that for each English word e, we have a distribution of likely Inuktitut substrings s: summationtexts pllr(s|e) = 1.</Paragraph> </Section> <Section position="2" start_page="75" end_page="76" type="sub_section"> <SectionTitle> 3.2 Word Alignment Strategy </SectionTitle> <Paragraph position="0"> Our approach for aligning an Inuktitut sentence of K tokens IK1 with an English sentence of N tokens EN1 (where K [?] N)2 consists of finding 2As a matter of fact, the number of Inuktitut words in the test corpus is always less than or equal to the number of English tokens for any sentence pair.</Paragraph> <Paragraph position="1"> K [?]1 cutting points ck[?][1,K[?]1] (ck [?] [1,N [?]1]) on the English side. A frontier ck delimits adjacent English words Eckck[?]1+1 that are translation of the single Inuktitut word Ik. With the convention that c0 = 0, cK = N and ck[?]1 < ck, we can formulate our alignment problem as seeking the best</Paragraph> <Paragraph position="3"> where dk = ck[?]ck[?]1 is the number of English words associated to Ik; p(dk) is the prior probability that dk English words are aligned to a single Inuktitut word, which we computed directly from Table 1; and a1 and a2 are two weighting coefficients. null We tried the following two approximations to compute p(Ik|Eckck[?]1+1). The second one led to better results.</Paragraph> <Paragraph position="5"> We considered several ways of computing the probability that an Inuktitut token I is the translation of an English one E; the best one we found being:</Paragraph> <Paragraph position="7"> where the summation is carried over all sub-strings s of I of 3 characters or more. pllr(s|E) is the normalized log-likelihood ratio score described above and pibm2(s|E) is the probability obtained from an IBM model 2 we trained after the Inuktitut side of the training corpus was segmented using a recursive procedure optimizing a frequency-based criterion. l is a weighting coefficient. null We tried to directly embed a model trained on whole (unsegmented) Inuktitut tokens, but noticed a degradation in performance (line 2 of Table 4).</Paragraph> </Section> <Section position="3" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 3.3 A Greedy Search Strategy </SectionTitle> <Paragraph position="0"> Due to its combinatorial nature, the maximization of equation 1 was barely tractable. Therefore we adopted a greedy strategy to reduce the search space. We first computed a split of the English sentence into K adjacent regions cK1 by virtually drawing a diagonal line we would observe if a character in one language was producing a constant number of characters in the other one.</Paragraph> <Paragraph position="1"> An initial word alignment was then found by simply tracking this diagonal at the word granularity level.</Paragraph> <Paragraph position="2"> Having this split in hand (line 1 of Table 4), we move each cutting point around its initial value starting from the leftmost cutting point and going rightward. Once a locally optimal cutting point has been found (that is, maximizing the score of equation 1), we proceed to the next one directly to its right.</Paragraph> </Section> </Section> <Section position="5" start_page="76" end_page="76" type="metho"> <SectionTitle> 3.4 Results </SectionTitle> <Paragraph position="0"> We report in Table 4 the performance of different variants we tried as measured on the development set. We used these performances to select the best configuration we eventually submitted.</Paragraph> <Paragraph position="1"> variant Prec. Rec. F-m. AER ment techniques measured on the DEV corpus. It is interesting to note that the starting point of the greedy search (line 1) does better than our first approach. However, moving from this initial split clearly improves the performance (line 3). Among the greedy variants we tested, we noticed that putting much of the weight l on the IBM model 2 yielded the best results. We also noticed that p(dk) in equation 1 did not help (a2 was close to zero). A character-based model might have been more appropriate to the case.</Paragraph> </Section> <Section position="6" start_page="76" end_page="77" type="metho"> <SectionTitle> 4 Combination of JAPA and NUKTI </SectionTitle> <Paragraph position="0"> One important weakness of our first approach lies in the cartesian product we generate when JAPA produces a n-m (n,m > 1) alignment. Thus, we tried a third approach: we apply NUKTI on any n-m alignment JAPA produces as if this initial alignment were in fact two (small) sentences to align, n- and m-word long respectively. We can therefore avoid the cartesian product and select word alignments more discerningly. As can be seen in Table 5, this combination improved over JAPA alone, while being worse than NUKTI alone.</Paragraph> </Section> class="xml-element"></Paper>