File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1411_metho.xml
Size: 20,594 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1411"> <Title>Towards a Simple and Accurate Statistical Approach to Learning Translation Relationships among Words</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> BD </SectionTitle> <Paragraph position="0"> and words in language C4</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> BE </SectionTitle> <Paragraph position="0"> in aligned sentences of a parallel bilingual corpus. AF Rank order pairs consisting of a word from</Paragraph> </Section> <Section position="6" start_page="1" end_page="1" type="metho"> <SectionTitle> BD </SectionTitle> <Paragraph position="0"> and a word from C4</Paragraph> </Section> <Section position="7" start_page="1" end_page="3" type="metho"> <SectionTitle> BE </SectionTitle> <Paragraph position="0"> according to the measure of association. The important work of Brown et al. (1993) is not directly comparable, since their globally-optimized generative probabilistic model of translation never has to make a firm commitment as to what can or cannot be a translation pair. They assign some nonzero probability to every possible translation pair.</Paragraph> <Paragraph position="1"> AF Choose a threshold, and add to the translation lexicon all pairs of words whose degree of association is above the threshold.</Paragraph> <Paragraph position="2"> As Melamed later (1996, 2000) pointed out, however, this technique is hampered by the existence of indirect associations between words that are not mutual translations. For example, in our parallel French-English corpus (consisting primarily of translated computer software manuals), two of the most strongly associated word lemma translation pairs are fichier/file and syst`eme/system. However, because the monolingual collocations syst`eme de fichiers, fichiers syst`eme, file system, and system files are so common, the spurious translation pairs fichier/system and syst`eme/file also receive rather high association scores--higher in fact that such true translation pairs as confiance/trust, parall'elisme/parallelism, and film/movie.</Paragraph> <Paragraph position="3"> Melamed's solution to this problem is not to regard highly-associated word pairs as translations in sentences in which there are even more highly-associated pairs involving one or both of the same words. Since indirect associations are normally weaker than direct ones, this usually succeeds in selecting true translation pairs over the spurious ones. For example, in parallel sentences containing fichier and syst`eme on the French side and file and system on the English side, the associations of fichier/system and syst`eme/file will be discounted, because the degrees of association for fichier/file and syst`eme/system are so much higher.</Paragraph> <Paragraph position="4"> Melamed's results using this approach extend the range of high-accuracy output to much higher coverage levels than previously reported. Our basic method is rooted the same insight regarding competing associations for the same words, but we embody it in simpler model that is easier to implement and, we believe, faster to run.</Paragraph> <Paragraph position="5"> As we will see below, our model yields results that seem comparable to Melamed's up to nearly 60% coverage of the lexicon.</Paragraph> <Paragraph position="6"> A second important issue regarding automatic derivation of translation relationships is the assumption implicit (or explicit) in most previous work that lexical translation relationships involve Melamed does not report computation time for the version of his approach without generation of compounds, but our approach omits a number of computationally very expensive steps performed in his approach.</Paragraph> <Paragraph position="7"> only single words. This is manifestly not the case, as is shown by the following list of translation pairs selected from our corpus: base de donn'ees/database mot de passe/password sauvegarder/back up annuler/roll back ouvrir session/log on Some of the most sophisticated work on this aspect of problem again seems to be that of Melamed (1997). Our approach in this case is quite different from Melamed's. It is more general in that it can propose compounds that are discontiguous in the training text, as roll back would be in a phrase such as roll the failed transaction back. Melamed does allow skipping over one or two function words, but our basic method is not limited at all by word adjacency. Also, our approach is again much simpler computationally than Melamed's and apparently runs orders of magnitude faster.</Paragraph> </Section> <Section position="8" start_page="3" end_page="5" type="metho"> <SectionTitle> 3 Our Basic Method </SectionTitle> <Paragraph position="0"> Our basic method for deriving translation pairs consists of the following steps: 1. Extract word lemmas from the logical forms produced by parsing the raw training data. 2. Compute association scores for individual lemmas.</Paragraph> <Paragraph position="1"> 3. Hypothesize occurrences of compounds in the training data, replacing lemmas constituting hypothesized occurrences of a compound with a single token representing the compound.</Paragraph> <Paragraph position="2"> 4. Recompute association scores for compounds and remaining individual lemmas. Melamed reports that training on 13 million words took over 800 hours in Perl on a 167-MHz UltraSPARC processor. Training our method on 6.6 million words took approximately 0.5 hours in Perl on a 1-GHz Pentium III processor. Even allowing an order of magnitude for the differences in processor speed and amount of data, there seems to be a difference between the two methods of at least two orders of magnitude in computation required. Unfortunately Melamed evaluates accuracy in his work on translation compounds differently from his work on single-word translation pairs, so we are not able to compare our method to his in that regard.</Paragraph> <Paragraph position="3"> 5. Recompute association scores, taking into account only co-occurrences such that there is no equally strong or stronger association for either item in the aligned logical-form pair.</Paragraph> <Paragraph position="4"> We describe each of these steps in detail below.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.1 Extracting word lemmas </SectionTitle> <Paragraph position="0"> In Step 1, we simply collect, for each sentence, the word lemmas identified by our MT system parser as the key content items in the logical form.</Paragraph> <Paragraph position="1"> These are predominantly morphologically analyzed word stems, omitting most function words.</Paragraph> <Paragraph position="2"> In addition, however, the parser treats certain lexical compounds as if they were single units. These include multi-word expressions placed in the lexicon because they have a specific meaning or use, plus a number of general categories including proper names, names of places, time expressions, dates, measure expressions, etc. We will refer to all of these generically as &quot;multiwords&quot;.</Paragraph> <Paragraph position="3"> The existence of multiwords simplifies learning some translation relationships, but makes others more complicated. For example, we do not, in fact, have to learn base de donn'ees as a compound translation for database, because it is extracted from the French logical forms already identified as a single unit. Thus we only need to learn the base de donn'ees/database correspondence as a simple one-to-one mapping. On the other hand, the disque dur/hard disk correspondence is learned as two-to-one relationship independently of disque/disk and dur/hard (which are also learned) because hard disk appears as a multiword in our English logical forms, but disque and dur always appear as separate tokens in our French logical forms.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.2 Computing association scores </SectionTitle> <Paragraph position="0"> For Step 2, we compute the degree of association part of the training corpus, compared to the frequency with which</Paragraph> <Paragraph position="2"> co-occur in aligned sentences of the training corpus. For this purpose, we ignore multiple occurrences of a lemma in a single sentence. As a measure of association, we use the loglikelihood-ratio statistic recommended by Dunning (1993), which is the same statistic used by Melamed to initialize his models. This statistic gives a measure of the likelihood that two samples are not generated by the same probability distribution. We use it to compare the over- null are independent, a measure of the likelihood that these distributions are different is, therefore, a measure of the likelihood that an observed positive association be-</Paragraph> <Paragraph position="4"> is not accidental.</Paragraph> <Paragraph position="5"> Since this process generates association scores for a huge number of lemma pairs for a large training corpus, we prune the set to restrict our consideration to those pairs having at least some chance of being considered as translation pairs. We heuristically set this threshold to be the degree of association of a pair of lemmas that have one co-occurrence, plus one other occurrence each.</Paragraph> </Section> <Section position="3" start_page="3" end_page="5" type="sub_section"> <SectionTitle> 3.3 Hypothesizing compounds and </SectionTitle> <Paragraph position="0"> recomputing association scores If our data were very clean and all translations were one-to-one, we would expect that in most aligned sentence pairs, each word or lemma would be most strongly associated with its translation in that sentence pair; since, as Melamed has argued, direct associations should be stronger than indirect ones. Since translation is symmetric, we would expect that if DB . Violations of this pattern are suggestive of translation relationships involving compounds. Thus, if we have a pair of aligned sentences in which password occurs in the English sentence and mot de passe occurs in the French side, we should not be surprised if mot and passe are both most strongly associated with password within this sentence pair. Password, however, cannot be most strongly associated with both mot and passe.</Paragraph> <Paragraph position="1"> Our method carrying out Step 3 is based on finding violations of the condition that whenever in a pair of aligned sentences. For each lemma, add a link to the uniquely most strongly associated lemma of the other language.</Paragraph> <Paragraph position="2"> Consider the maximal, connected subgraphs of the resulting graph. If all translations within the sentence pair are one-to-one, each of these subgraphs should contain exactly two lemmas, one from C4</Paragraph> <Paragraph position="4"> more than two lemmas of one of the languages, we consider all the lemmas of that language in the subgraph to form a compound. In the case of mot, passe, and password, as described above, there would be a connected subgraph containing these three lemmas; so the two French lemmas, mot and passe, would be considered to form a compound in the French sentence under consideration.</Paragraph> <Paragraph position="5"> The output of this step of our process is a transformed set of lemmas for each sentence in the corpus. For each sentence and each subset of the lemmas in that sentence that has been hypothesized to form a compound in the sentence, we replace those lemmas with a token representing them as a single unit. Note that this process works on a sentence-pair by sentence-pair basis, so that a compound hypothesized for one sentence pair may not be hypothesized for a different sentence pair, if the pattern of strongest associations for the two sentence pairs differ. Order of occurrence is not considered in forming these compounds, and the same token is always used to represent the same set of lemmas.</Paragraph> <Paragraph position="6"> Once the sets of lemmas for the training corpus have been reformulated in terms of the hypothesized compounds, Step 4 consists simply in repeating step 2 on the reformulated training data.</Paragraph> </Section> <Section position="4" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 3.4 Recomputing association scores, taking </SectionTitle> <Paragraph position="0"> into account only strongest associations If Steps 1-4 worked perfectly, we would have correctly identified all the compounds needed for translation and reformulated the training data to treat each such compound as a single item. At this Because the data becomes quite noisy if a lemma has no lemmas in the other language that are very strongly associated with it, we place a heuristically chosen threshold on the minimum degree of association that is allowed to produce a link.</Paragraph> <Paragraph position="1"> The surface order is not needed by the alignment procedure intended to make use of the translation relationships we discover.</Paragraph> <Paragraph position="2"> point, we should be able to treat the training data as if all translations are one-to-one. We therefore choose our final set of ranked translation pairs on the assumption that true translation pairs will always be mutually most strongly associated in a given aligned sentence pair.</Paragraph> <Paragraph position="3"> Step 5 thus proceeds exactly as step 4, except , among the lemmas (or compound lemmas) present in a given aligned sentence pair. (The associations computed by the previous step are used to make these decisions.) This final set of associations is then sorted in decreasing order of strength of association.</Paragraph> </Section> </Section> <Section position="9" start_page="5" end_page="6" type="metho"> <SectionTitle> 4 Identifying Translations of &quot;Captoids&quot; </SectionTitle> <Paragraph position="0"> In addition to using these techniques to provide translation relationships to the logical-form alignment process, we have applied related methods to a problem that arises in parsing the raw input text.</Paragraph> <Paragraph position="1"> Often in text-particularly the kind of technical text we are experimenting with-phrases are used, not in their usual way, but as the name of something in the domain. Consider, Click to remove the View As Web Page check mark. In this sentence, View As Web Page has the syntactic form of a nonfinite verb phrase, but it is used as if it is a proper name. If the parser does not recognize this special use, it is virtually impossible to parse the sentence correctly.</Paragraph> <Paragraph position="2"> Expressions of this type are fairly easily handled by our English parser, however, because capitalization conventions in English make them easy to recognize. The tokenizer used to prepare sentences for parsing, under certain conditions, hypothesizes that sequences of capitalized words such as View As Web Page should be treated as lexicalized multi-word expressions, as discussed in Section 3.1. We refer to this subclass of multiwords as &quot;captoids&quot;. The capitalization conventions of French (or Spanish) make it harder to recognize such expressions, however, because typically only the first word of such an expression is capitalized.</Paragraph> <Paragraph position="3"> We have adapted the methods described in Section 3 to address this problem by finding sequences of French words that are highly associated with English captoids. The sequences of French words that we find are then added to the French lexicon as multiwords.</Paragraph> <Paragraph position="4"> The procedure for identifying translations of captiods is as follows: 1. Tokenize the training data to separate words from punctuation and identify multiwords wherever possible.</Paragraph> <Paragraph position="5"> 2. Compute association scores for items in the tokenized data.</Paragraph> <Paragraph position="6"> 3. Hypothesize sequences of French words as compounds corresponding to English multiwords, replacing hypothesized occurrences of a compound in the training data with a single token representing the compound.</Paragraph> <Paragraph position="7"> 4. Recompute association scores for pairs of items where either the English item or the French item is a multiword beginning with a capital letter.</Paragraph> <Paragraph position="8"> 5. Filter the resulting list to include only translation pairs such that there is no equally strong or stronger association for either item in the training data.</Paragraph> <Paragraph position="9"> There are a number of key differences from our previous procedure. First, since this process is meant to provide input to parsing, it works on tokenized word sequences rather than lemmas extracted from logical forms. Because many of the English multiwords are so rare that associations for the entire multiword are rather weak, in Step 2 we count occurrences of the constituent words contained in multiwords as well as occurrences of the multiwords themselves. Thus an occurrence of View As Web Page would also count as an occurrence of view, as, web, and page.</Paragraph> <Paragraph position="10"> The method of hypothesizing compounds in Step 3 adds a number of special features to improve accuracy and coverage. Since we know we are trying to find French translations for English captoids, we look for compounds only in the French data. If any of the association scores between a French word and the constituent words of an English multiword are higher than the association score between the French word and the entire multiword, we use the highest such score to represent the degree of association between the French In identifying captoid translations, we ignore case differences for computing and using association scores. word and the English multiword. We reserve, for consideration as the basis of compounds, only sets of French words that are most strongly associated in a particular aligned sentence pair with an English multiword that starts with a capitalized word.</Paragraph> <Paragraph position="11"> Finally we scan the French sentence of the aligned pair from left to right, looking for a capitalized word that is a member of one of the compound-defining sets for the pair. When we find such a word, we begin constructing a French multiword. We continue scanning to the right to find other members of the compound-defining set, allowing up to two consecutive words not in the set, provided that another word in the set immediately follows, in order to account for French function words than might not have high associations with anything in the English multiword. We stop adding to the French multiword once we have found all the French words in the compound-defining set, or if we encounter a punctuation symbol, or if we encounter three or more consecutive words not in the set. If either of the latter two conditions occurs before exhausting the compound-defining set, we assume that the remaining members of the set represent spurious associations and we leave them out of the French multiword.</Paragraph> <Paragraph position="12"> The restriction in Step 4 to consider only associations in which one of the items is a mutiword beginning with a capital letter is simply for efficiency, since from this point onward no other associations are of interest.</Paragraph> <Paragraph position="13"> The final filter applied in Step 5 is more stringent than in our basic method. The reasoning is that, while a single word may have more than one translation in different contexts, the sort of complex multiword represented by a captoid would normally be expected to receive the same translation in all contexts. Therefore we accept only translations involving captoids that are mutually uniquely most strongly associated across the entire corpus. To focus on the cases we are most interested in and to increase accuracy, we require each translation pair generated to satisfy the following additional conditions: AF The French item must be one of the mulitwords we constructed.</Paragraph> <Paragraph position="14"> whose constituent words are capitalized.</Paragraph> <Paragraph position="15"> AF The French item must contain at least as many words as the English item.</Paragraph> <Paragraph position="16"> The last condition corrects some errors made by allowing highly associated French words to be left out of the hypothesized compounds.</Paragraph> </Section> class="xml-element"></Paper>