File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1101_intro.xml
Size: 5,859 bytes
Last Modified: 2025-10-06 14:03:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1101"> <Title>Linguistic Distances</Title> <Section position="4" start_page="0" end_page="2" type="intro"> <SectionTitle> 2 Pronunciation </SectionTitle> <Paragraph position="0"> John Laver, the author of the most widely used textbook in phonetics, claimed that &quot;one of the most basic concepts in phonetics, and one of the least discussed, is that of phonetic similarity [boldface in original, JN & EH]&quot; (Laver, 1994, p. 391), justifying the attention the workshop pays to it. Laver goes on to sketch the work that has been done on phonetic similarity, or, more exactly, phonetic distance, in particular, the empirical derivation of confusion matrices, which indicate the likelihood with which people or speech recognition systems confusion one sound for another. Miller & Nicely (1955) founded this approach with studies of how humans confused some sounds more readily than others. Although &quot;confusability&quot; is a reasonable reflection of phonetic similarity, it is perhaps worth noting that confusion matrices are often asymmetric, suggesting that something more complex is at play. Clark & Yallop (1995, p. 319ff) discuss this line of work further, suggesting more sophisticated analyses which aggregate confusion matrices based on segments.</Paragraph> <Paragraph position="1"> In addition to the phonetic interest (above), phonologists have likewise shown interest in the question of similarity, especially in recent work. Albright and Hayes (2003) have proposed a model of phonological learning which relies on &quot;minimal generalization&quot;. The idea is that children learn e.g. rules of allomorphy on the basis not merely of rules and individual lexical exceptions (the earlier standard wisdom), but rather on the basis of slight but reliable generalizations. An example is the formation of the past tense of verbs ending in [IN], 'ing' (fling, sing, sting, spring, string) that build past tenses as 'ung' [2N]. We omit details but note that the &quot;minimal generalization&quot; is minimally DISTANT in pronunciation.</Paragraph> <Paragraph position="2"> Frisch, Pierrehumbert & Broe (2004) have also kindled an interest in segmental similarity among phonologists with their claim that syllables in Semitic languages are constrained to have unlike consonants in syllable onset and coda. Their work has not gone unchallenged (Bailey and Hahn, 2005; Hahn and Bailey, 2005), but it has certainly created further theoretical interest in phonological similarity.</Paragraph> <Paragraph position="3"> There has been a great deal of attention in psycholinguistics to the the problem of word recognition, and several models appeal explicitly to the &quot;degree of phonetic similarity among the words&quot; (Luce and Pisoni, 1998, p. 1), but most of these models employ relatively simple notions of sequence similarity and/or, e.g., the idea that distance may be operationalized by the number or replacements needed to derive one word from another--ignoring the problem of similarity among words of different lengths (Vitevitch and Luce, 1999). Perhaps more sophisticated computational models of pronunciation distance could play a role in these models in the future.</Paragraph> <Paragraph position="4"> Kessler (1995) showed how to employ edit distance to operationalize pronunciation difference in order to investigate dialectology more precisely, an idea which, particular, Heeringa (2004) pursued at great length. Kondrak (2002) created a variant of the dynamic programming algorithm used to compute edit distance which he used to identify cognates in historical linguistics. McMahon & McMahon (2005) include investigations of pronunciation similarity in their recent book on phylogenetic techniques in historical linguistics. Several of the contributions to this volume build on these earlier efforts or are relevant to them.</Paragraph> <Paragraph position="5"> Kondrak and Sherif (this volume) continue the investigation into techniques for identifying cognates, now comparing several techniques which rely solely on parameters set by the researcher to machine learning techniques which automatically optimize those parameters. They show the the machine learning techniques to be superior, in particular, techniques basic on hidden Markov models and dynamic Bayesian nets.</Paragraph> <Paragraph position="6"> Heeringa et al. (this volume) investigate several extensions of the fundamental edit distance algorithm for use in dialectology, including sensitivity to order and context as well syllabicity constraints, which they argue to be preferable, and length normalization and graded weighting schemes, which they argue against.</Paragraph> <Paragraph position="7"> Dinu & Dinu (this volume) investigate metrics on string distances which attach more importance to the initial parts of the string. They embed this insight into a scheme in whichn-grams are ranked (sorted) by frequency, and the difference in the rankings is used to assay language differences.</Paragraph> <Paragraph position="8"> Their paper proves that difference in rankings is a proper mathematical metric.</Paragraph> <Paragraph position="9"> Singh (this volume) investigates the technical question of identifying languages and character encoding systems from limited amounts of text.</Paragraph> <Paragraph position="10"> He collects about 1,000 or so of the most frequent n-grams of various sizes and then classifies next texts based on the similarity between the fre- null quency distributions of the known texts with those of texts to be classified. His empirical results show &quot;mutual cross entropy&quot; to identify similarity most reliably, but there are several close competitors.</Paragraph> </Section> class="xml-element"></Paper>