File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1108_evalu.xml

Size: 11,813 bytes

Last Modified: 2025-10-06 13:59:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1108">
  <Title>Evaluation of String Distance Algorithms for Dialectology</Title>
  <Section position="7" start_page="57" end_page="60" type="evalu">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In this section we present results based on the Norwegian and German data sources in 5.1 and Sections 5.3.</Paragraph>
    <Paragraph position="1"> For each data source we consider 40 string comparison algorithms. We distinguish between methods with a binary comparison of n-grams and those with a gradual comparison of n-grams (see Section 2.4). Within the category of binary methods, we distinguish between three groups. In the first group, strings are compared just by counting the number of common n-grams, ignoring the order of elements, see Section 2.1). In the second group the n-grams are aligned (see Section 2.2).</Paragraph>
    <Paragraph position="2"> We call this 'free alignment'. In the third group we insist on the linguistically informed alignment of n-grams (see Section 2.5), dubbing this 'forced alignment'. Within the category of gradual methods, we distinguish between 'free alignment' (see Section 2.6) and 'forced alignment'. Finally, for each of these methods, we consider both an un-normalized version of the measure as well as one normalized by length (see Section 2.3).</Paragraph>
    <Paragraph position="3"> A measure can only be valid when it is consistent, but it may be consistent without being valid. Since consistency is a necessary condi- null binary gradual no free forc. free forc.</Paragraph>
    <Paragraph position="4"> align- align- align- align- alignment ment ment ment ment  tances and unnormalized string edit distance measurements among 15 Norwegian dialects. Higher coefficients indicate better results.</Paragraph>
    <Paragraph position="5"> tion for validity, we check the consistency of phonetic distance methods. For each of the methods we calculated Cronbach's a values, which is based on the average inter-correlation among the words (Heeringa, 2004, pp. 170-173). A widelyaccepted threshold in social science for an acceptable a is 0.70 (Nunnally, 1978). After the consistency check, we discuss validation results.</Paragraph>
    <Section position="1" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
5.1 Norwegian Perception
</SectionTitle>
      <Paragraph position="0"> In this section we first discuss results of unnormalized string edit distance measures, and will compare them with their normalized counterparts farther onwards in this section.</Paragraph>
      <Paragraph position="1"> The Cronbach's a values of the unnormalized measurements vary from 0.84 to 0.87. The Cronbach's a values of the methods with 'forced alignment' are a bit lower than thea values of the other methods. An outlier arises when using the 'forced alignment' and gradual bigram distances: a=0.78, but these all indicate that the measurements are quite consistent.</Paragraph>
      <Paragraph position="2"> We calculated correlations to the perceptual distances which are described in Section 4.1. Results are given in Table 1. Let's note that the effect size, i.e., the r value itself, is quite high, 0.66 &lt; r &lt; 0.73, meaning that the various distance measure are accounting for 43.6-53.3% of the variance in the perception measurements. All of the correlation coefficients are massively significant (p &lt; 0.001), but given the stringency of the Mantel test, they do not differ significantly from one another.</Paragraph>
      <Paragraph position="3"> The correlations are quite similar. The maximal difference we found was 0.07, so that we conclude that none of the methods is strikingly better or worse in operationalizing the level of pronunciation difference that dialect speakers are sensitive binary gradual no free forc. free forc.</Paragraph>
      <Paragraph position="4"> align- align- align- align- alignment ment ment ment ment  tances and different normalized string edit distance measurements among 15 Norwegian dialects. Higher coefficients indicate better results. to.</Paragraph>
      <Paragraph position="5"> The small flood of numbers in Table 1 may seem confusing. Therefore we calculated averages per factor which are presented in Table 4. We invite the reader to refer to both Table 1 and Tablee 4 in following the discussion below. Table 4 shows systematic differences. For example, contextually sensitive measures (bigrams, trigrams, and xbigrams) are usually better (and never worse) than unigram measures. The differences among the different means of operationalizing context (bigrams, trigrams and xbigrams) seem unremarkable, however. Third, measures which are sensitive to linear order are slightly worse than those which are not (variants of DICE) on average5.</Paragraph>
      <Paragraph position="6"> But when comparing the first column in Table 1 with the others, we see that the highest correlations (0.73) are found among the order sensitive methods. Fourth, forcing alignment to respect vowel/consonant differences yields a modest improvement in scores. Fifth, we see no clear advantage in measurements which weight n-grams more sensitively to those binary comparison methods which distinguish only same and different. Sixth, and most surprisingly, we can compare Table 1 which provides the correlation of edit distances which were not normalized for length, with Table 2, which provides the results of the measurements which were normalized. For some normalized measurements the Cronbach'sa value are minimally higher (0.01). But comparison of the correlation coefficients shows that normalization never improves measurements, and often leads to a deterioration. In Table 4 averages for the normalized measurements are given. Normalized mea5When using the unnormalized versions of the 'DICE' family, the distance is just equal to the number of non-shared n-grams.</Paragraph>
      <Paragraph position="7">  binary gradual no free forc. free forc.</Paragraph>
      <Paragraph position="8"> align- align- align- align- alignment ment ment ment ment  distances for the unnormalized string edit distance measurements between 15 Norwegian dialects. The lower the local incoherence value, the better the measurement technique.</Paragraph>
      <Paragraph position="9"> surements display the same systematic differences that unnormalized measurements show, except for the differences between methods which consider the order of segments and methods which do not.</Paragraph>
      <Paragraph position="10"> Measures which are sensitive to linear order are slightly better than those which are not (variants of DICE).</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
5.2 Norwegian Geographic Sensitivity
</SectionTitle>
      <Paragraph position="0"> As we mentioned in Section 4.2, Norway is very rugged. Therefore we based our local incoherence values on travel distances rather than on geographic distances &amp;quot;as the crow flies&amp;quot;. We computed local incoherence values for both unnormalized and normalized string edit distance measurements. The comparison confirms the findings of Section 5.1: unnormalized methods always perform better than normalized ones. The unnormalized results are presented in Table 3.</Paragraph>
      <Paragraph position="1"> Recall that lower local incoherence values should reflect better measurement techniques.</Paragraph>
      <Paragraph position="2"> When we examine the table as a whole, we note again that the various techniques are not hugely different--they perform with similar degrees of success.</Paragraph>
      <Paragraph position="3"> In Table 4, we find average local incoherence values for the factors under investigation. We find first that contextually sensitive measures (bigrams, trigrams, and xbigrams) are again superior to unigram methods, and second, measures which are sensitive to linear order are superior to the DICElike methods (unnormalized versions). Third, linguistically informed alignments, which respect the vowel/consonant distinction, perform better than uninformed (&amp;quot;free&amp;quot;) alignment (for the normalized versions). Fourth, the average values do not suggest any benefit to the gradual weighting of n-grams in comparison with the binary weighting.</Paragraph>
      <Paragraph position="4"> Most surprisingly, normalization again appears to have a deleterious effect on the probity of the measurements. null We must stress again that these finer interpretations results require confirmation with a larger set of sites.</Paragraph>
    </Section>
    <Section position="3" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
5.3 German Geographic Sensitivity
</SectionTitle>
      <Paragraph position="0"> When checking the consistency of the German measurements we find Cronbach's a values of 0.95 and 0.96 for all methods without alignment or with 'free alignment' and for all unigram based methods. The higher Cronbach's a levels for this data set reflect the fact that it is larger. We find lower a values of 0.83-0.85 for the methods with 'forced alignment'. This accords with the consistency results for the Norwegian measurements.</Paragraph>
      <Paragraph position="1"> When using bigrams, a is equal to 0.80 (binary, normalized), 0.51 (gradual, normalized), 0.74 (binary, unnormalized) and 0.45 (gradual, unnormalized). These low values are striking, and we found no explanation for them, but they suggest that we should not attach much significance to this combination of measurement properties. On average, the unnormalized a's are the same as the normalized a's.</Paragraph>
      <Paragraph position="2"> Since consistency values are higher than 0.70 (with one exception), we validated the methods by calculating the geographic local incoherence values. We would have preferred to use perceptions, but we have no such data in the German case.</Paragraph>
      <Paragraph position="3"> Since we found unnormalized string edit distance measurements superior to normalized ones in the Sections 5.1 and 5.2, we focus in this section on the unnormalized methods. Unnormalized results are shown in Table 5.</Paragraph>
      <Paragraph position="4"> Recall that the lower the local incoherence value, the better the measurement technique. We include this table for the sake of completeness, but it is clear that the results do not jibe with the results obtained from the Norwegian data. Unigram-based processing appears to be superior, and context inferior; order-sensitive processing is inferior to order-insensitive processing, and linguistically informed (&amp;quot;forced&amp;quot;) alignment appears to offer no advantage.</Paragraph>
      <Paragraph position="5"> We leave the contrast between the Norwegian and German results as a puzzle to be addressed in future work, but it should be clear that we have  tance measurements among 15 Norwegian dialects. Higher coefficients and lower local incoherence values indicate better results.</Paragraph>
      <Paragraph position="6"> binary gradual no free forc. free forc.</Paragraph>
      <Paragraph position="7"> align- align- align- align- alignment ment ment ment ment  graphic distances for for the unnormalized string edit distance measurements 186 German dialects.</Paragraph>
      <Paragraph position="8"> The lower the local incoherence value, the better the measurement technique.</Paragraph>
      <Paragraph position="9"> rather more confidence in the Norwegian than in the German results. This is due on the one had to the availability of independently behavioral data we can use to independently validate our computations, but also to the more stable set of values we see in the case of the Norwegian data. Exactly why the German data is so much more variable is also a question we must postpone to future work.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML