File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0818_metho.xml
Size: 5,382 bytes
Last Modified: 2025-10-06 14:10:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0818"> <Title>LIHLA: Shared task system description</Title> <Section position="4" start_page="0" end_page="111" type="metho"> <SectionTitle> 2 How LIHLA works </SectionTitle> <Paragraph position="0"> As the first step, LIHLA uses alignments between single words defined in two bilingual lexicons (source-target and target-source) generated from sentence-aligned parallel texts using NATools.1 Given two sentence-aligned corpus files, the NATools word aligner --based on the Twenty-One system (Hiemstra, 1998)-- counts the co-occurrences of words in all aligned sentence pairs and builds a sparse matrix of word-to-word probabilities (Model A) using an iterative expectation-maximization algorithm (5 iterations by default). Finally, the elements with higher values in the matrix are chosen to compose two probabilistic bilingual lexicons (source-target and target-source) (Sim~oes and Almeida, 2003). For each word in the corpus, each bilingual lexicon gives: the number of occurrences of that word in the corpus (its absolute frequency) and its most likely translations together with their probabilities.</Paragraph> <Paragraph position="1"> The construction of the bilingual lexicons is an independent prior step for the alignment performed by LIHLA and the same bilingual lexicons can be used several times to align parallel sentences.</Paragraph> <Paragraph position="2"> So, using the two bilingual lexicons generated by NATools and some language-independent heuristics, LIHLA tries to find the best alignment between source and target tokens (words, numbers, special characters, etc.) in a pair of parallel sentences. For each source token sj in source sentence S, LIHLA will look for the best token ti in the target parallel sentence T applying these heuristics in sequence:</Paragraph> </Section> <Section position="5" start_page="111" end_page="111" type="metho"> <SectionTitle> 1. Exact match </SectionTitle> <Paragraph position="0"> LIHLA creates a 1 : 1 alignment between sj and ti if they are identical. This heuristic stays for exact matches, for instance, between proper names and numbers.</Paragraph> <Paragraph position="1"> 2. Best candidate according to the bilingual lexicon LIHLA looks for possible translations of sj in the source-target bilingual lexicon (BS) and makes an intersection between them and the words in T. In this intersection, if no candidate word identical to those in BS is found, then LIHLA tries to look for cognates for those words using the longest common subsequence ratio (LCSR).2 By doing this, LIHLA can deal with small changes in possible translations such as different forms of the same verb, changes in gender and/or number of nouns, adjectives, and so on.</Paragraph> <Paragraph position="2"> Then, LIHLA selects the best target candidate word ti for sj --the best candidate word according to BS among those in a position which is favorably situated in relation to sj-- and looks for multiword units involving sj and ti --those words that occur immediately before and/or after sj (for source multiword units) or 2The LCSR of two words is computed by dividing the length of their longest common subsequence by the length of the longer word. For example, the LCSR of Portuguese word alinhamento and Spanish word alineamiento is 1012 similarequal 0.83 as their longest common subsequence is a-l-i-n-a-m-e-n-t-o. ti (for target multiword units) and are not possible translations for other words in T and S, respectively. According to the multiword units that have (or not) been found, a 1 : 1, 1 : n, m : 1 or m : n alignment is established. An omission alignment for sj (1 : 0) can also be established if no target candidate word ti that satisfies this heuristic is available.</Paragraph> </Section> <Section position="6" start_page="111" end_page="111" type="metho"> <SectionTitle> 3. Cognates </SectionTitle> <Paragraph position="0"> If no possible translation for sj is found in the bilingual lexicon and the target sentence (T) at the same time, LIHLA uses the LCSR to look for cognates for sj in T and sets a 1 : 1 alignment between sj and its best cognate or a 1 : 0 alignment if there is no cognate available.</Paragraph> <Paragraph position="1"> These heuristics are applied while alignments can still be produced and a maximum number of iterations is not reached (see section 3 for the number of iterations performed in the experiments described in this paper). Furthermore, at the first iteration, all words with a frequency higher than a set threshold are ignored to avoid erroneous alignments since all subsequent alignments are based on the previous ones.</Paragraph> <Paragraph position="2"> In its last step (which is optional and has not been performed in the experiments described in this paper), LIHLA aligns the remaining unaligned source and target tokens between two pairs of already aligned tokens establishing several 1 : 1 alignments when there are the same number of source and target tokens, or just one alignment involving all source and target tokens if they exist in different quantities. The decision of creating n 1 : 1 alignments in spite of just one n : n alignment when there is the same number of source and target tokens is due to the fact that a 1 : 1 alignment is more likely to be found than a n : n one.</Paragraph> </Section> class="xml-element"></Paper>