File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1107_intro.xml
Size: 11,282 bytes
Last Modified: 2025-10-06 14:03:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1107"> <Title>Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification</Title> <Section position="3" start_page="0" end_page="44" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The problem of measuring phonetic similarity between words arises in various contexts, including speech processing, spelling correction, commercial trademarks, dialectometry, and cross-language information retrieval (Kessler, 2005). A number of different schemes for computing word similarity have been proposed. Most of those methods are derived from the notion of edit distance. In its simplest form, edit distance is the minimum number of edit operations required to transform one word into the other. The set of edit operations typically includes substitutions, insertions, and deletions, and may incorporate more complex transformations.</Paragraph> <Paragraph position="1"> By assigning variable weights to various edit operations depending on the characters involved in the operations, one can design similarity schemes that are more sensitive to a given task. Such variable weight schemes can be divided into two main groups. One approach is to manually design edit operation weights on the basis of linguistic intuition and/or physical measurements. Another approach is to use machine learning techniques to derive the weights automatically from training data composed of a set of word pairs that are considered similar.</Paragraph> <Paragraph position="2"> The manually-designed schemes tend to be somewhat arbitrary, but can be readily applied to diverse tasks. The learning approaches are also easily adaptable to various tasks, but they crucially require training data sets of reasonable size. In general, the more complex the underlying model, the larger the data sets needed for parameter estimation.</Paragraph> <Paragraph position="3"> In this paper, we focus on a few representatives of both approaches, and compare their performance on the speci c task of cognate identi cation. Cognate identi cation is a problem of nding, in distinct languages, words that can be traced back to a common word in a proto-language. Beyond historical linguistics, cognate identi cation has applications in other areas of computational linguistics (Mackay and Kondrak, 2005). Because the likelihood that two words across different languages are cognates is highly correlated with their phonetic similarity, cognate identi cation provides an objective test of the quality of phonetic similarity schemes.</Paragraph> <Paragraph position="4"> The remainder of this paper is organized as fol- null lows. Section 2 discusses the two manually designed schemes: the ALINE algorithm and a linguistically-motivated metric. Section 3 discusses various learning approaches. In Section 4, we describe Dynamic Bayesian Nets. Finally, in Section 5, we discuss the results of our experiments.</Paragraph> <Paragraph position="5"> 2 Two manually constructed schemes In this section, we rst describe two different constructed schemes and then compare their properties.</Paragraph> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 2.1 ALINE </SectionTitle> <Paragraph position="0"> The ALINE algorithm (Kondrak, 2000) assigns a similarity score to pairs of phonetically-transcribed words on the basis of the decomposition of phonemes into elementary phonetic features. The algorithm was originally designed to identify and align cognates in vocabularies of related languages. Nevertheless, thanks to its grounding in universal phonetic principles, the algorithm can be used for estimating the similarity of any pair of words.</Paragraph> <Paragraph position="1"> The principal component of ALINE is a function that calculates the similarity of two phonemes that are expressed in terms of about a dozen multi-valued phonetic features (Place, Manner, Voice, etc.). The phonetic features are assigned salience weights that express their relative importance. Feature values are encoded as oating-point numbers in the range [0,1]. For example, the feature Manner can take any of the following seven values: stop = 1.0, affricate = 0.9, fricative = 0.8, approximant = 0.6, high vowel = 0.4, mid vowel = 0.2, and low vowel = 0.0. The numerical values re ect the distances between vocal organs during speech production.</Paragraph> <Paragraph position="2"> The overall similarity score is the sum of individual similarity scores between pairs of phonemes in an optimal alignment of two words, which is computed by a dynamic programming algorithm (Wagner and Fischer, 1974). A constant insertion/deletion penalty is applied for each unaligned phoneme.</Paragraph> <Paragraph position="3"> Another constant penalty is set to reduce relative importance of vowel as opposed to consonant phoneme matches. The similarity value is normalized by the length of the longer word.</Paragraph> <Paragraph position="4"> ALINE's behavior is controlled by a number of parameters: the maximum phonemic score, the insertion/deletion penalty, the vowel penalty, and the feature salience weights. The parameters have default settings for the cognate matching task, but these settings can be optimized (tuned) on a development set that includes both positive and negative examples of similar words.</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 2.2 A linguistically-motivated metric </SectionTitle> <Paragraph position="0"> Phonetically natural classes such as /p b m/ are much more common among world's languages than unnatural classes such as /o z g/. In order to show that the bias towards phonetically natural patterns of phonological classes can be modeled without stipulating phonological features, Mielke (2005) developed a phonetic distance metric based on acoustic and articulatory measures. Mielke's metric encompasses 63 phonetic segments that are found in the inventories of multiple languages. Each phonetic segment is represented by a 7-dimensional vector that contains three acoustic dimensions and four articulatory dimensions (perceptual dimensions were left out because of the dif culties involved in comparing almost two thousand different sound pairs). The phonetic distance between any two phonetic segments were then computed as the Euclidean distance between the corresponding vectors.</Paragraph> <Paragraph position="1"> For determining the acoustic vectors, the recordings of 63 sounds were rst transformed into waveform matrices. Next, distances between pairs of matrices were calculated using the Dynamic Time Warping technique. These acoustic distances were subsequently mapped to three acoustic dimensions using multidimensional scaling. The three dimensions can be interpreted roughly as (a) sonorous vs. sibilant, (b) grave vs. acute, and (c) low vs. high formant density.</Paragraph> <Paragraph position="2"> The articulatory dimensions were based on ultrasound images of the tongue and palate, video images of the face, and oral and nasal air ow measurements. The four articulatory dimensions were: oral constriction location, oral constriction size, lip constriction size, and nasal/oral air ow ratio.</Paragraph> </Section> <Section position="3" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 2.3 Comparison </SectionTitle> <Paragraph position="0"> When ALINE was initially designed, there did not exist any concrete linguistically-motivated similarity scheme to which it could be compared. Therefore, it is interesting to perform such a comparison with the recently proposed metric.</Paragraph> <Paragraph position="1"> The principal dif culty in employing the metric for computing word similarity is the limited size of the phonetic segment set, which was dictated by practical considerations. The underlying database of phonological inventories representing 610 languages contains more than 900 distinct phonetic segments, of which almost half occur in only one language. However, because a number of complex measurements have to be performed for each sound, only 63 phonetic segments were analyzed, which is a set large enough to cover only about 20% of languages in the database. The set does not include such common phones as dental fricatives (which occur in English and Spanish), and front rounded vowels (which occur in French and German). It is not at all clear how one to derive pairwise distances involving sounds that are not in the set.</Paragraph> <Paragraph position="2"> In contrast, ALINE produces a similarity score for any two phonetic segment so long as they can be expressed using the program's set of phonetic features. The feature set can in turn be easily extended to include additional phonetic features required for expressing unusual sounds. In practice, any IPA symbol can be encoded as a vector of universal phonetic features.</Paragraph> <Paragraph position="3"> Another criticism that could be raised against Mielke's metric is that it has no obvious reference point. The choice of the particular suite of acoustic and articulatory measurements that underlie the metric is not explicitly justi ed. It is not obvious how one would decide between different metrics for modeling phonetic generalizations if more than one were available.</Paragraph> <Paragraph position="4"> On the other hand, ALINE was designed with a speci c reference in mind, namely cognate identication. The goodness of alternative similarity schemes can be objectively measured on a test set containing both cognates and unrelated pairs from various languages.</Paragraph> <Paragraph position="5"> A perusal of individual distances in Mielke's metric reveals that some of them seem quite unintuitive. For example, [t] is closer to [j] than it is to [a0 ], [a1 ] is closer to [n] than to [i], [a2 ] is closer to [e] than to [g]. etc. This may be caused either by the omission of perceptual features from the underlying set of features, or by the assignment of uniform weights to different features (Mielke, personal communication). null It is dif cult to objectively measure which phonetic similarity scheme produces more intuitive values. In order to approximate a human evaluation, we performed a comparison with the perceptual judgments of Laver (1994), who assigned numerical values to pairwise comparisons of 22 English consonantal phonemes on the basis of subjective auditory impressions . We counted the number of perceptual con icts with respect to Laver's judgments for both Mielke's metric and ALINE's similarity values. For example, the triple ([a3 ], [j], [k]) is an example of a con ict because [a3 ] is considered closer to [j] than to [k] in Mielke's matrix but the order is the opposite in Laver's matrix. The program identi ed 1246 conicts with Mielke's metric, compared to 1058 conicts with ALINE's scheme, out of 4620 triples. We conclude that in spite of the fact that ALINE is designed for identifying cognates, rather than directly for phonetic similarity, it is more in agreement with human perceptual judgments than Mielke's metric which was explicitly designed for quantifying phonetic similarity.</Paragraph> </Section> </Section> class="xml-element"></Paper>