File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1108_metho.xml
Size: 26,974 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1108"> <Title>Evaluation of String Distance Algorithms for Dialectology</Title> <Section position="4" start_page="52" end_page="55" type="metho"> <SectionTitle> 2 String Comparison Algorithms </SectionTitle> <Paragraph position="0"> In this section we describe a number of string comparison algorithms largely following Inkpen et al. (2005). The methods can be classified according to different factors: representation (unigram, bigram, trigram, xbigram), comparison of n-grams (binary or gradual), status of order (with or without alignment), and type of alignment (free or forced alignment with respect to the vowel/consonant distinction). We illustrate the methods with examples, in which we compare German and Dutch dialect pronunciations of the word milk.1</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 2.1 Contextual sensitivity </SectionTitle> <Paragraph position="0"> In the German dialect of Reelkirchen milk is pronounced as [mElk@]. The bigram notation is [-m mE El lk k@ @-] and the trigram notation is [--m -mE mEl Elk lk@ k@- @--]. The same word is pronounced as [mEl@c,] in the German dialect of Tann.</Paragraph> <Paragraph position="1"> The bigram and trigram representations are [-m mE El l@ @c, c,-] and [--m -mE mEl El@ l@c, @c,- c,--] respectively.</Paragraph> <Paragraph position="2"> In the simplest method we present in this paper, the distance is found by calculating 1 minus twice the number of shared segment n-grams divided by the total number ofn-grams in both words. Inkpen et al. mention a bigram-based, a trigram-based and a xbigram-based procedure, which they call DICE, TRIGRAM and XDICE respectively. We also consider an unigram-based procedure which we call UNIGRAM. The two pronunciations share four unigrams: [m, E, l] and [@]. There are 5 + 5 = 10 unigram tokens in total in the two words, so the unigram similarity is (2x4)/10 = 0.8, and the distance 1[?]0.8 = 0.2. The two pronunciations share three bigrams: [-m, mE] and [El]. There are 6 + 6 = 12 bigram tokens in the two strings, so bigram similarity is (2x3)/12 = 0.5, and the distance 1[?]0.5 = 0.5. Finally, the two pronuncia1Our transcriptions omit diacritics for simplicity's sake. tions have three trigrams in common: [--m, -mE] and [mEl] among 7+7 = 14 in total, yielding a tri-gram similarity of (2x3)/14 = 0.4 and distance 1[?]0.4 = 0.6.</Paragraph> <Paragraph position="3"> Our interest in this issue is linguistic: longer n-grams allow comparison on the basis of phonic context, and unigram comparisons have correctly been criticized for ignoring this (Kessler, 2005).</Paragraph> </Section> <Section position="2" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 2.2 Order of segments </SectionTitle> <Paragraph position="0"> When comparing the German dialect pronunciation of Reelkirchen [mElk@] with the Dutch dialect pronunciation of Haarlem [mEl@k], the unigram procedure presented above will detect no difference. One might argue that we are dealing with a swap, but this is effectively an appeal to order.</Paragraph> <Paragraph position="1"> The example is not convincing for n-gram measures, n [?] 2, but we should prefer to separate issues of order from issues of context sensitivity.</Paragraph> <Paragraph position="2"> We use edit distance (aka Levenshtein distance) for this purpose, and we assume familiarity with this (Kruskal, 1999). In our use of edit distance all operations have a cost of 1.</Paragraph> </Section> <Section position="3" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 2.3 Normalization by length </SectionTitle> <Paragraph position="0"> When the edit distance is divided by the length of the longer string, Inkpen et al. call it normalized edit distance (NED). In our approach we divide &quot;raw edit distance&quot; by alignment length. The same minimum distance found by the edit distance algorithm may be obtained on the basis of several alignments which may have different lengths.</Paragraph> <Paragraph position="1"> We found that the longest alignment has the greatest number of matches. Therefore we normalize by dividing the edit distance by the length of the longest alignment.</Paragraph> <Paragraph position="2"> We have normally employed a length normalization in earlier work (Heeringa, 2004), reasoning that words are such fundamental linguistic units that dialect perception was likely to be wordbased. We shall test this premise in this paper.</Paragraph> <Paragraph position="3"> Marzal & Vidal (1993) show that the normalized edit distance between two strings cannot be obtained via &quot;post-normalization&quot;, i.e., by first computing the (unnormalized) edit distance and then normalizing this by the length of the corresponding editing path. Unnormalized edit distance satisfies the triangle inequality, which is axiomatic for distances, but the quantities obtained via post-normalization need not satisfy this axiom. Marzdal & Vidal provide an alternative procedure which is guaranteed to produce genuine distances, satisfying all of the relevant axioms. In their modified algorithm, one computes one minimum weight for each of the possible lengths of editing paths at each point in the computational lattice. Once all these weights are calculated, they are divided by their corresponding path lengths, and the minimum quotient represents the normalized edit distance.</Paragraph> <Paragraph position="4"> The basic idea behind edit distance is to find the minimum cost of changing one string into another.</Paragraph> <Paragraph position="5"> Length normalization represents a deviation from this basic idea. If a higher cost corresponds with a longer path length so that quotient of the edit costs divided by the path length is minimal, then Marzal & Vidal's procedure opts for the minimal normalized length, while post-normalization seeks what one might call &quot;the normalized minimal length&quot; (see Marzal & Vidal's example 3.1 and Figure 2, p. 928).</Paragraph> <Paragraph position="6"> Marzal & Vidal's examples of normalized minimal distances which are not also minimal normalized distances all involve operation costs we normally do not employ. In particular they allow IN-DELS (insertions and deletions) to be associated with much lower costs than substitutions, so that the longer paths associated with derivations involving indels is more than compensated by the length normalization. Our costs are never structured in this way, so we conjecture that our postnormalizations do not genuinely run the risk of violating the distance axioms. We use 0 for the cost of mapping a symbol to itself, 1 to map it to a different symbol, including the empty symbol (covering the costs of indels), and[?]for non-allowed mappings2 We maintain therefore that (unnormalized) costs higher than the minimum will never correspond to longer alignment lengths. If this is so, then the minimal edit cost divided by alignment length will also be the minimal normalized cost. If the unnormalized edit distance is minimal, we claim that the post-normalized edit distance must therefore be minimal as well.</Paragraph> <Paragraph position="7"> We inspect an example to illustrate these issues.</Paragraph> <Paragraph position="8"> We compare the Frisian (Grouw), [mOlk@], with the Haarlem pronunciation [mEl@k]. The Levenshtein algorithm may align the pronunciations as follows: [?]is assigned to the replacement of a vowel by a consonant in order to avoid alignments which violate syllabic structure.</Paragraph> <Paragraph position="10"> The one pronunciation is transformed into the other by substituting [E] for [O], by deleting [@] after [l], and by inserting [@] after [k]. Since each operation has a cost of 1, and the alignment is 6 elements long, the normalized distance is (1 + 1 + 1)/6 = 0.5. The Levenshtein distance will also find an alignment in which the [@]'s are matched, while the [k]'s are inserted and deleted. That alignment gives the same (normalized) distance. Levenshtein distance will not find an alignment any longer than the one shown here, since longer alignments will not yield the minimum cost. This also holds for the examples shown below.</Paragraph> </Section> <Section position="4" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 2.4 n-gram weights </SectionTitle> <Paragraph position="0"> In the dialect of the German dialect of Frohnhausen milk is pronounced as [mIlj@], and in the German of Grosswechsungen as [mElIc,]. If we compare these using the techniques of Section 2.2, using bigrams, we obtain the following:</Paragraph> <Paragraph position="2"> Since n-grams are compared in a binary way, the normalized distance is equal to (1 + 1 + 1 + 1 + 1)/6 = 0.83. But [mI] and [mE] (second position) are clearly more similar to each other than [j@] and [Ik] (fifth position). Inkpen et al. suggest weightingn-gram differences using segment overlap. They provide a formula for measuring gradual similarity of n-grams to be used in BI-DIST and TRI-DIST. Since we measure distances rather than similarity, we calculate n-gram distance as follows:</Paragraph> <Paragraph position="4"/> </Section> <Section position="5" start_page="54" end_page="54" type="sub_section"> <SectionTitle> 2.5 Linguistic Alignment </SectionTitle> <Paragraph position="0"> When comparing the Frisian (Grouw) dialect pronunciation, [mOlk@], with that of German Grosswechsungen, [mElIc,], using unigrams, we obtain: null</Paragraph> <Paragraph position="2"> The normalized distance is then (1 + 1 + 1)/5 = 0.6. But this is linguistically an implausible alignment: syllables do not align when e.g. [k] aligns with [I], etc. We may remedy this by requiring the Levenshtein algorithm to respect the distinction between vowels and consonants, requiring that the alignments respect this distinction with only three exceptions, in particular that semivowels [j, w] may match vowels (or consonants), that the maximally high vowels [i, u] match consonants (or vowels), and that [@] match sonorant consonants (nasals and liquids) in addition to vowels. Disallowed matches are weighted so heavily (via the cost of the substitution operation) that the algorithm always will use alternative alignments, effectively preferring insertions and deletions (indels) instead. Applying these restrictions, we obtain the following, with normalized distance</Paragraph> <Paragraph position="4"> In comparisons based on bigrams, we allow two bigrams to match when at least one segment pair matches, the first, the second, or both. Two trigrams match when at least the middle pair matches. Comparing the same pronunciations as above using bigrams without linguistic conditions, we obtain the following alignment: 1 2 3 4 5 6 -m mO Ol lk k@ @-m mE El lI Ic, c,1 1 1 1 1 0.5 0.5 0.5 1 0.5 The normalized distance is (1 + 1 + 1 + 1 + 1)/6 = 0.83 using binary bigram weights (costs), and (0.5 + 0.5 + 0.5 + 1 + 0.5)/6 = 0.5 using gradual weights. But the above alignment does not respect the vowel/consonant distinction at the fifth position, where neither [k] vs. [I] nor [@] vs. [c,] is allowed. We correct this at once: The calculation based on gradual weights is a bit more complex. Two bigrams may match even when a non-allowed pair occurs in one of the two positions, e.g., [k] vs. [I] at the fourth position in the alignment immediately above. The cost of this match should be higher (via weights) than that of an allowed pair with different elements--e.g., the pair [O] versus [E] at the second or third position-but not so high that the match cannot occur. We settle on the following scheme. Two n-grams [x1...xn] and [y1...yn] can only match if at least one pair (xi,yi) matches linguistically. We weight linguistically mismatching pairs (xj,yj) twice as high as matching (but non-identical) pairs. Since we have at most n[?] 1 matching pairs, and at least 1 mismatching pair, we set the most expensive match of twon-grams to 1, and we assign the weight of 2/(2n[?]1) to a mismatching pair, and 1/(n[?]1) to a matching (but nonidentical) one. Indels cost the same as the most costly (matching) n-grams, in this case 1.</Paragraph> <Paragraph position="5"> In our bigram-based example, we obtain a weight of 2/(2 x 2 [?] 1) = 0.67 at position 4, since the pair [k] vs. [I] is a linguistic mismatch. At positions 2 and 3 we obtain weights of 1/(2x2[?]1) = 0.33 since [O] and [E] are (nonidentical) matches. Note that a segment (vowel or consonant) versus '-' (boundary) is processed as a mismatch. Therefore the weight at position 6 is equal to 0.33 ([k] vs. [c,]) +0.67 ([@] versus [-]), summing to 1.</Paragraph> </Section> <Section position="6" start_page="54" end_page="55" type="sub_section"> <SectionTitle> 2.6 Similarity vs. distance </SectionTitle> <Paragraph position="0"> Theoretically, similarity and distance should be each others' inverses. Thus in Section 2.1 we suggested that similarity should always be (1 [?] distance). This is not always straightforward when we normalize.</Paragraph> <Paragraph position="1"> Inkpen et al. use both similarity and distance measures. Similarity measures are LCSR (Longest Common Subsequence Ratio), BI-SIM and TRI-SIM (LCSR generalized to bigrams and trigrams), and the corresponding distance measures are NED, BI-DIST and TRI-DIST. The measures are further distinguished in the way n-gram weights are compared: as binary weights in the similarity measures, and as gradual weights in the distance measures. When comparing the pronunciations of Frisian Hindelopen [mO@lk@] with German Grosswechsungen, [mElIc,], and respecting the linguistic alignment conditions (Section 2.5) we obtain:</Paragraph> <Paragraph position="3"> The non-normalized similarity is equal to 2, and the non-normalized distance is equal to 5. Inkpen et al. normalize &quot;by dividing the total edit cost by the length of the longer string&quot; which is 6 in our example. Other possibilities are dividing by the length of the shorter string (5), the average length of the two strings (5.5) or the length of the alignment (7). Summarizing: shorter longer average alignstring string string ment sim. 0.4 0.33 0.36 0.29 dist. 1.0 0.83 0.91 0.71 total 1.4 1.17 1.27 1.00 Only the normalization via alignment length respects the wish that we regard similarity and distance as each others' inverses. 3 We can enforce this requirement in other approaches by first normalizing and then taking the inverse, but we take the result above to indicate that normalization via alignment length is the most natural procedure.</Paragraph> </Section> </Section> <Section position="5" start_page="55" end_page="55" type="metho"> <SectionTitle> 3 Data Sources </SectionTitle> <Paragraph position="0"> The methods presented in Section 2 are applied to Norwegian and German dialect data described in this section. We emphasize that we measured distances only at the level of the segmental base, ignoring stress and tone marks, suprasegmentals and diacritics. We in fact examined measurements which included the effects of segmental diacritics, which, however resulted in decreased consistency and no apparent increase in quality.</Paragraph> <Section position="1" start_page="55" end_page="55" type="sub_section"> <SectionTitle> 3.1 Norwegian </SectionTitle> <Paragraph position="0"> The Norwegian data comes from a database comprising more than 50 dialect sites, compiled by Jorn Almberg and Kristian Skarbo of the Department of Linguistics of the University of Trond3We have no proof that normalization by alignment length always allows this simple relation to similarity, but we have examined a large number of calculations in which this always seems to hold.</Paragraph> <Paragraph position="1"> heim.4 The database includes recordings and transcriptions of the fable 'The North Wind and the Sun' in various Norwegian dialects. The Norwegian text consists of 58 different words, some of which occur more than once, in which case we seek a least expensive pairing of the different elements (Nerbonne and Kleiweg, 2003, p. 349).</Paragraph> <Paragraph position="2"> On the basis of the recordings, Gooskens carried out a perception experiment which we describe in Section 4.1. The experiment is based on 15 dialects, the total number of dialects available at that time (spring, 2000). Since we want to use the results of the experiment for validating our methods, we used the same set of 15 Norwegian dialects. It is important to note that Gooskens presented the recordings holistically, including differences in syntax, intonation and morphology. Our methods are restricted to words.</Paragraph> </Section> <Section position="2" start_page="55" end_page="55" type="sub_section"> <SectionTitle> 3.2 German </SectionTitle> <Paragraph position="0"> The German data comes from the Phonetischer Atlas Deutschlands and includes 186 dialect locations. For each location 201 words were recorded and transcribed. The data are available at the Forschungsinstitut f&quot;ur deutsche Sprache Deutscher Sprachatlas in Marburg. The material is from translations of Wenker-S&quot;atze, taken from the famous survey by Georg Wenker in the 18791887 among teachers from[?] 40.000 locations in Germany. The transcriptions are made on the basis of recordings made under the direction of Joachim G&quot;oschel in the 1960's and 1970's in West Germany (G&quot;oschel 1992, pp. 64-70). After the German reunification similar surveys were conducted in former East Germany.</Paragraph> <Paragraph position="1"> The data were transcribed by four transcribers, and each item was transcribed independently by at least two phoneticians who subsequently consulted to come to an agreement. In 2002 the data was digitized at the University of Groningen.</Paragraph> </Section> </Section> <Section position="6" start_page="55" end_page="57" type="metho"> <SectionTitle> 4 Validation Methods </SectionTitle> <Paragraph position="0"> When we apply a measurement technique to a specific problem we are interested both in the consistency of the measure and in its validity. The consistency of the measurement reflects the degree to which the independent elements in the sample sample tend to provide the same signal. Nunnally (1978, p.211) recommends the generalized form of the Spearman-Brown formula for this purpose, which has come to be known as the CRON-BACH'S avalue. It is determined by the inter-item correlation, i.e. the average correlation coefficient for all of the pairs of items in the test, and the test size. The Cronbach's a measure rises with the sample size, and it is therefore normally used to determine whether samples are large enough to provide reliable signals.</Paragraph> <Paragraph position="1"> The validity of a measure, or more precisely, the application of a measure to a particular problem is much more difficult and controversial issue (Nunnally, 1978, Chap. 3), but the basic issue is whether the procedures in fact measure what they purport to measure, in our case the sort of pronunciation similarity which is important in distinguishing similar language varieties. In examining our measures for their validity in identifying the sort of pronunciation similarity which plays a role in dialectology we compare the measures to other indications we have that pronunciations are dialectally similar. We discuss these below in more detail. We consider the correlation with distances as perceived by the dialect speakers themselves (see Section 4.1) and the local (geographic) incoherence of dialect distances (see Section 4.2).</Paragraph> <Section position="1" start_page="56" end_page="56" type="sub_section"> <SectionTitle> 4.1 Perception </SectionTitle> <Paragraph position="0"> The best opportunity for examining the quality of the measurements presents itself in the case of Norwegian, for which we were able to obtain the results of a perception experiment (Gooskens and Heeringa, 2004). For each of 15 varieties a recording of the fable 'The North Wind and the Sun' was presented to 15 groups of Norwegian high school pupils, one group from each of the 15 dialects sites represented in the material. All pupils were familiar with their own dialect and had lived most of their lives in the place in question (on average 16.7 years). Each group consisted of 16 to 27 listeners.</Paragraph> <Paragraph position="1"> The mean age of the listeners was 17.8 years, 52 percent were female and 48 percent male.</Paragraph> <Paragraph position="2"> The 15 dialects were presented in a randomized order, and each session was preceded by a (short) practice run. While listening to the dialects the listeners were asked to judge each of the 15 dialects on a scale from 1 (similar to native dialect) to 10 (not similar to native dialect). This means that each group of listeners judged the linguistic distances between their own dialect and the 15 dialects, including their own dialect. In this way we get a matrix with 15 x 15 perceived linguistic distances. This matrix is not completely symmetric. For example, the distance which the listeners from Bergen perceived between their own dialect and the dialect of Trondheim (8.55) is different from the distance as perceived by the listeners from Trondheim to Bergen (7.84).</Paragraph> <Paragraph position="3"> In order to use this material to calibrate the different computational measurements, we examine the correlations between the 15x15 computational matrices with the 15x15 perceptual matrix. In calculating correlations we excluded the distances of dialects with respect to themselves, i.e. the distance of Bergen to Bergen, of Bjugn to Bjugn, etc. In computational matrices these values are always zero, in the perceptual matrix they vary, but are normally greater than zero. This may be due to non-geographic (social or individual) variation, but it distorts results in a non-random way (diagonal distances can only be too high, never too low), we exclude them when calculating the correlation coefficient.</Paragraph> <Paragraph position="4"> We calculated the standard Pearson productmoment correlation coefficient, but we interpret its significance cautiously, using the Mantel test (Bonnet and Van de Peer, 2002). In classical tests the assumption is made that the observations are independent, which observations in distance matrices emphatically are not. This is certainly true for calculations of geographic distances, which are minimally constrained to satisfy the standard distance axioms (non-negativity, symmetry, and the triangle inequality). We have argued above (SS2.2) that the edit distances we employ are likewise genuine distances, which means that sums of edit distances are likewise constrained, and therefore should not be regarded as independent observations (in the sense need for hypothesis testing). The Mantel test raises the standards of significance a good deal-- so much that it will turn out that our small (15x15) matrices would need to differ by more than 0.1 in correlation coefficient in order to demonstrate significance. We will nonetheless urge that the results should be taken seriously as the data needed is difficult to obtain, and the indications are fairly clear (see below).</Paragraph> </Section> <Section position="2" start_page="56" end_page="57" type="sub_section"> <SectionTitle> 4.2 Local Incoherence </SectionTitle> <Paragraph position="0"> It is fundamental to dialectology that geographically closer varieties are, in general, linguistically more similar. Nerbonne and Kleiweg (2006) use this fact to select more probative measurements, namely those measurements which maximize the degree to which geographically close elements are likewise seen to be linguistically similar. Given our emphasis on distance it is slightly more convenient to formulate a measure of LOCAL INCO-HERENCE and then to examine the degree to which various string distance measures minimize it. The basic idea is that we begin with each measurement site s, and inspect the n linguistically most similar sites in order of decreasing linguistic similarity to s. We then measure how far away these linguistically most similar sites are geographically, for example, in kilometers. Good measurements show that linguistically similar sites are geographically close better than poor measurements do.</Paragraph> <Paragraph position="1"> The details of the formulation reflect the results of dialectometry that dialect distances certainly increase with geographic distance, leveling off, however, so that geographically more remote variety-pairs tend to have more nearly the same linguistic distances to each other. We sort variety pairs in order of decreasing linguistic similarity and weight more similar ones exponentially more than less similar ones. Given this disproportionate weighting of the most similar varieties, it also quickly becomes uninteresting to incorporate the effects of more than a small number of geographically closest varieties. We restrict our attention to the eight most similar linguistic varieties in calculating local incoherence.</Paragraph> <Paragraph position="3"> dLi,j, dGi,j : geo. dist. between i en j dLi,1***n[?]1 : geo. dist. sorted by increasing ling. diff. dGi,1***n[?]1 : geo. dist, sorted by increasing geo. dist. Several remarks may be helpful in understanding the proposed measurement. First, all of thedi,j concern geographic distances. dLi,1***n[?]1 (summed in DLi ) range over the geographic distances, arranged, however, in increasing order of linguistic distance, while dGi,1***n[?]1 (summed in DGi ) ranges over the geographic distances among the sites in the sample, arranged in increasing order of geographic distance. We examine the latter as an ideal case. If a given measurement technique always demonstrated that the neighbors of a given site used the most similar varieties, then DLi would be the same DGi , and Il would be 0. Second, we have argued above that it is appropriate to count most similar varieties much more heavily in Il, and this is reflected in the exponential decay in the weighting, i.e., 2[?]0.5j where j ranges over the increasingly less similar sites. Given this weighting of most similar varieties, we are also justified in restricting the sum inDLi = summationtextkj=1[...] tok = 8, and all of the results below use this limitation, which likewise improves efficiency.</Paragraph> <Paragraph position="4"> We suppress further discussion of the calculation in the interest of saving space here, noting, however, that we used two different notions of geographic distance. When examining measurements of the German data, we measured geographic distance &quot;as the crow flies&quot;, but since Norway is very mountainous, we used (19th century) travel distances (Gooskens, ).</Paragraph> </Section> </Section> class="xml-element"></Paper>