File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1107_metho.xml
Size: 18,670 bytes
Last Modified: 2025-10-06 14:10:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1107"> <Title>Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification</Title> <Section position="4" start_page="44" end_page="45" type="metho"> <SectionTitle> 3 Learning algorithms </SectionTitle> <Paragraph position="0"> In this section, we brie y describe several machine learning algorithms that automatically derive weights or probabilities for different edit operations.</Paragraph> <Section position="1" start_page="44" end_page="45" type="sub_section"> <SectionTitle> 3.1 Stochastic transducer </SectionTitle> <Paragraph position="0"> Ristad and Yianilos (1998) attempt to model edit distance more robustly by using Expectation Maximization to learn probabilities for each of the possible edit operations. These probabilities are then used to create a stochastic transducer, which scores a pair of words based on either the most probable sequence of operations that could produce the two words (Viterbi scoring), or the sum of the scores of all possible paths that could have produced the two words (stochastic scoring). The score of an individual path here is simply the product of the probabilities of the edit operations in the path. The algorithm was evaluated on the task of matching surface pronunciations in the Switchboard data to their canonical pronunciations in a lexicon, yielding a signi cant improvement in accuracy over Levenshtein distance.</Paragraph> </Section> <Section position="2" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 3.2 Levenshtein with learned weights </SectionTitle> <Paragraph position="0"> Mann and Yarowsky (2001) applied the stochastic transducer of Ristad and Yianilos (1998) for inducing translation lexicons between two languages, but found that in some cases it offered no improvement over Levenshtein distance. In order to remedy this problem, they they proposed to lter the probabilities learned by EM into a few discrete cost classes, which are then used in the standard edit distance algorithm. The LLW approach yielded improvement over both regular Levenshtein and the stochastic transducer.</Paragraph> </Section> <Section position="3" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 3.3 CORDI </SectionTitle> <Paragraph position="0"> CORDI (Kondrak, 2002) is a program for detecting recurrent sound correspondences in bilingual wordlists. The idea is to relate recurrent sound correspondences in wordlists to translational equivalences in bitexts. A translation model is induced between phonemes in two wordlists by combining the maximum similarity alignment with the competitive linking algorithm of Melamed (2000). Melamed's approach is based on the one-to-one assumption, which implies that every word in the bitext is aligned with at most one word on the other side of the bitext.</Paragraph> <Paragraph position="1"> In the context of the bilingual wordlists, the correspondences determined under the one-to-one assumption are restricted to link single phonemes to single phonemes. Nevertheless, the method is powerful enough to determine valid correspondences in wordlists in which the fraction of cognate pairs is well below 50%.</Paragraph> <Paragraph position="2"> The discovered phoneme correspondences can be used to compute a correspondence-based similarity score between two words. Each valid correspondence is counted as a link and contributes a constant positive score (no crossing links are allowed). Each unlinked segment, with the exception of the segments beyond the rightmost link, is assigned a smaller negative score. The alignment with the highest score is found using dynamic programming (Wagner and Fischer, 1974). If more than one best alignment exists, links are assigned the weight averaged over the entire set of best alignments. Finally, the score is normalized by dividing it by the average of the lengths of the two words.</Paragraph> </Section> <Section position="4" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 3.4 Pair HMM </SectionTitle> <Paragraph position="0"> Mackay and Kondrak (2005) propose to computing similarity between pairs of words with a technique adapted from the eld of bioinformatics. A Pair Hidden Markov Model differs form a standard HMM by producing two output streams in parallel, each corresponding to a word that is being aligned. The model has three states that correspond to the basic edit operations: substitution, insertion, and deletion. The parameters of the model are automatically learned from training data that consists of word pairs that are known to be similar. The model is trained using the Baum-Welch algorithm (Baum et al., 1970).</Paragraph> </Section> </Section> <Section position="5" start_page="45" end_page="46" type="metho"> <SectionTitle> 4 Dynamic Bayesian Nets </SectionTitle> <Paragraph position="0"> A Bayesian Net is a directed acyclic graph in which each of the nodes represents a random variable.</Paragraph> <Paragraph position="1"> The random variable can be either deterministic, in which case the node can only take on one value for a given con guration of its parents, or stochastic, in which case the con guration of the parents determines the probability distribution of the node. Arcs in the net represent dependency relationships.</Paragraph> <Paragraph position="2"> Filali and Bilmes (2005) proposed to use Dynamic Bayesian Nets (DBNs) for computing word similarity. A DBN is a Bayesian Net where a set of arcs and nodes are maintained for each point in time in a dynamic process. This involves set of prologue frames denoting the beginning of the process, chunk frames which are repeated for the middle of the process, and epilogue frames to end the process.</Paragraph> <Paragraph position="3"> The conditional probability relationships are timeindependent. DBNs can encode quite complex interdependencies between states.</Paragraph> <Paragraph position="4"> We tested four different DBN models on the task of cognate identi cation. In the following description of the models, Z denotes the current edit operation, which can be either a substitution, an insertion, or a deletion.</Paragraph> <Paragraph position="5"> MCI The memoriless context-independent model (Figure 1) is the most basic model, which is meant to be equivalent to the stochastic transducer of Ristad and Yianilos (1998). Its lack of memory signi es that the probability of Z taking on a given value does not depend in any way on what previous values of Z have been.</Paragraph> <Paragraph position="6"> The context-independence refers to the fact that the probability of Z taking on a certain value does not depend on the letters of the source or target word. The a and b nodes in Figure 1 represent the current position in the source and target words, respectively. The s and t nodes represent the current letter in the source and target words. The end node is a switching parent of Z and is triggered when the values of the a and b nodes move past the end of both the source and target words. The sc and tc nodes are consistency nodes which ensure that the current edit operation is consistent with the current letters in the source and target words. Consistency here means that the source side of the edit operation must either match the current source letter or be e, and that the same be true for the target side. Finally, the send and tend nodes appear only in the last frame of the model, and are only given a positive probability if both words have already been completely processed, or if the nal edit operation will conclude both words.</Paragraph> <Paragraph position="7"> The following models all use the MCI model as a basic framework, while adding new dependencies to Z.</Paragraph> <Paragraph position="8"> MEM In the memory model, the probability of the current operation being performed depends on what the previous operation was.</Paragraph> <Paragraph position="9"> CON In the context-dependent model, the probability that Z takes on certain values is dependent on letters in the source word or target word.</Paragraph> <Paragraph position="10"> The model that we test in Section 5, takes into account the context of two letters in the source word: the current one and the immediately preceding one. We experimented with several other variations of context sets, but they either performed poorly on the development set, or required inordinate amounts of memory.</Paragraph> <Paragraph position="11"> LEN The length model learns the probability distribution of the number of edit operations to be performed, which is the incorporated into the similarity score. This model represents an attempt to counterbalance the effect of longer words being assigned lower probabilities.</Paragraph> <Paragraph position="12"> The models were implemented with the GMTK toolkit (Bilmes and Zweig, 2002). A more detailed description of the models can be found in (Filali and Bilmes, 2005).</Paragraph> </Section> <Section position="6" start_page="46" end_page="49" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 5.1 Setup </SectionTitle> <Paragraph position="0"> We evaluated various methods for computing word similarity on the task of the identi cation of cognates. The input consists of pairs of words that have the same meaning in distinct languages. For each pair, the system produces a score representing the likelihood that the words are cognate. Ideally, the scores for true cognate pairs should always be higher than scores assigned to unrelated pairs.</Paragraph> <Paragraph position="1"> For binary classi cation, a speci c score threshold could be applied, but we defer the decision on the precision-recall trade-off to downstream applications. Instead, we order the candidate pairs by their scores, and evaluate the ranking using 11-point interpolated average precision (Manning and Schutze, 2001). Scores are normalized by the length of the longer word in the pair.</Paragraph> <Paragraph position="2"> Word similarity is not always a perfect indicator of cognation because it can also result from lexical borrowing and random chance. It is also possible that two words are cognates and yet exhibit little surface similarity. Therefore, the upper bound for average precision is likely to be substantially lower than 100%.</Paragraph> </Section> <Section position="2" start_page="47" end_page="47" type="sub_section"> <SectionTitle> 5.2 Data </SectionTitle> <Paragraph position="0"> The training data for our cognate identi cation experiments comes from the Comparative Indoeuropean Data Corpus (Dyen et al., 1992). The data contains word lists of 200 basic meanings representing 95 speech varieties from the Indoeuropean family of languages. Each word is represented in an orthographic form without diacritics using the 26 letters of the Roman alphabet. Approximately 180,000 cognate pairs were extracted from the corpus.</Paragraph> <Paragraph position="1"> The development set was composed of three language pairs: Italian-Croatian, Spanish-Romanian, and Polish-Russian. We chose these three language pairs because they represent very different levels of relatedness: 25.3%, 58.5%, and 73.5% of the word pairs are cognates, respectively. The percentage of cognates within the data is important, as it provides a simple baseline from which to compare the success of our algorithms. If our cognate identi cation process were random, we would expect to get roughly these percentages for our recognition precision (on average).</Paragraph> <Paragraph position="2"> The test set consisted of ve 200-word lists representing English, German, French, Latin, and Albanian, compiled by Kessler (2001). The lists for these languages were removed from the training data (except Latin, which was not part of the training set), in order to keep the testing and training data as separate as possible. For the supervised experiments, we converted the test data to have the same orthographic representation as the training data.</Paragraph> <Paragraph position="3"> The training process for the DBN models consisted of three iterations of Expectation Maximization, which was determined to be optimal on the development data. Each pair was used twice, once in each source-target direction, to enforce the symmetry of the scoring, One of the models, the context-dependent model, remained asymmetrical despite to two-way training. In order to remove the undesirable asymmetry, we averaged the scores in both directions for each word pair.</Paragraph> </Section> <Section position="3" start_page="47" end_page="48" type="sub_section"> <SectionTitle> 5.3 Results </SectionTitle> <Paragraph position="0"> Table 1 shows the average cognate identi cation precision on the test set for a number of methods. EDIT is a baseline edit distance with uniform costs. MIEL refers to edit distance with weights computed using the approach outlined in (Mielke, 2005). ALINE denotes the algorithm for aligning phonetic sequences (Kondrak, 2000) described in Section 2.1. R&Y is the stochastic transducer of Ristad and Yianilos (1998). LLW stands for Levenshtein with learned weights, which is a modi cation of R&Y proposed by Mann and Yarowsky (2001). The PHMM column provides the results reported in (Mackay and Kondrak, 2005) for the best Pair HMM model, which uses log odds scoring. Finally, DBN stands for our best results obtained with a DBN model, in this case the averaged context model.</Paragraph> <Paragraph position="1"> Table 2 show the aggregate results for various DBN models. Two different results are given for each model: the raw score, and the score normal- null various DBN models.</Paragraph> <Paragraph position="2"> ized by the length of the longer word. The models are the memoriless context-independent model (MCI), memory model (MEM), length model (LEN) and context model (CON). The context model results are split as follows: results in the original direction (FOR), results with all word pairs reversed (REV), and the results of averaging the scores for each word pair in the forward and reverse directions (AVE).</Paragraph> <Paragraph position="3"> Table 3 shows the aggregate results for the unsupervised approaches. In the unsupervised tests, the training set was not used, as the models were trained directly on the testing data without access to the cognation information. For the unsupervised tests, the original, the test set was in its original phonetic form. The table compares the results obtained with various DBN models and with the CORDI algorithm described in Section 3.3.</Paragraph> </Section> <Section position="4" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 5.4 Discussion </SectionTitle> <Paragraph position="0"> The results in Table 1 strongly suggest that the learning approaches are more effective than the manually-designed schemes for cognate identi cation. However, it has to be remembered that the learning process was conducted on a relatively large set of Indoeuropean cognates. Even though there was no overlap between the training and the test set, the latter also contained cognate pairs from the same language family. For each of the removed languages, there are other closely related languages that are retained in the training set, which may exhibit similar or even identical regular correspondences.</Paragraph> <Paragraph position="1"> The manually-designed schemes have the advantage of not requiring any training sets after they have been developed. Nevertheless, Mielke's metric appears to produce only small improvement over metric, which is not surprising considering that ALINE was developed speci cally for identifying cognates, and Mielke's substitution matrix lacks several phonemes that occur in the test set.</Paragraph> <Paragraph position="2"> Among the DBN models, the average context model performs the best. The averaged context model is clearly better than either of the unidirectional models on which it is based. It is likely that the averaging allows the scoring to take contextual information from both words into account, instead of just one or the other. The averaged context DBN model performs about as well as on average as the Pair HMM approach, but substantially better than the R&Y approach and its modi cation, LLW.</Paragraph> <Paragraph position="3"> In the unsupervised context, all DBN models fail to perform meaningfully, regardless of whether the scores are normalized or not. In view of this, it is remarkable that CORDI achieves a respectable performance just by utilizing discovered correspondences, having no knowledge of phonetics nor identity of phonemes. The precision of CORDI is at the same level as the phonetically-based ALINE. In fact, a method that combines ALINE and CORDI achieves the average precision of 0.681 on the same test set (Kondrak, in preparation).</Paragraph> <Paragraph position="4"> In comparison with the results of Filali and Bilmes (2005), certain differences are apparent. The memory and length models, which performed better than the memoriless context-independent model on the pronunciation task, perform worse overall here.</Paragraph> <Paragraph position="5"> This is especially notable in the case of the length model which was the best overall performer on their task. The context-dependent model, however, performed well on both tasks.</Paragraph> <Paragraph position="6"> As mentioned in (Mann and Yarowsky, 2001), it appears that there are signi cant differences between the pronunciation task and the cognate iden- null ti cation task. They offer some hypotheses as to why this may be the case, such as noise in the data and the size of the training sets, but these issues are not apparent in the task presented here. The training set was quite large and consisted only of known cognates. The two tasks are inherently different, in that scoring in the pronunciation task involves nding the best match of a surface pronunciation with pronunciations in a lexicon, while the cognate task involves the ordering of scores relative to each other. Certain issues, such as length of words, may become more prominent in this setup. We countered this by normalizing all scores, which was not done in (Filali and Bilmes, 2005). As can be seen in Table 2, the normalization by length appears to improve the results on average. It notable that normalization even helps the length model on this task, despite the fact that it was designed to take word length into account.</Paragraph> </Section> </Section> class="xml-element"></Paper>