File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0626_metho.xml

Size: 16,017 bytes

Last Modified: 2025-10-06 14:15:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0626">
  <Title>Automatic Construction of Weighted String Similarity Measures</Title>
  <Section position="4" start_page="213" end_page="214" type="metho">
    <SectionTitle>
3 Basic Techniques
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="213" end_page="214" type="sub_section">
      <SectionTitle>
Dynamic Programming
</SectionTitle>
      <Paragraph position="0"> A common technique for computing the length of the longest common subsequence for two given 2LCSR scores were calculated for tokens containing at least one alphabetic character and a threshold of 0.7 was used to filter the resulting list.</Paragraph>
      <Paragraph position="1"> b a 1 a n c e a r m  strings is to apply a dynamic programming algorithm (Stephen, 1992). If n is the length of string x and m is the length of string y an (0..n, 0..re)matrix L describes the array of correspondences for these two strings. The initial column and the initial line of this matrix is set to 0. Now, a character matching function m has to be defined. The following definition for m is used to calculate the length of the LCS:</Paragraph>
      <Paragraph position="3"> Now, the matrix can be filled dynamically starting with the origin and using the following constraint: Vi&lt;nVj&lt;m : lij = max(li-l,j, lij-1, li-l,j-l+m(xi, yj)) Finally, the last field in this matrix contains the length of the LCS for the given strings. Note, that matching is defined for each element of the alphabet of characters including special symbols and white spaces. Consider the example in figure 1.</Paragraph>
      <Paragraph position="4"> This algorithm can be modified by changing the character matching function. One possibility is to set priorities for specific matches by defining weights for the corresponding character. Now, the function m has to be modified to ra(x, y) = w(x) in all cases of x = y where w(x) is a weight for the character x z. Another possibility is to define a complete character matching function for all elements from the alphabet. That means, each m(x, y) defines an independent matching value for the pair Ix, y\]4.</Paragraph>
      <Paragraph position="5">  After this modifications the final value of the dynamic algorithm described above will be changed according to the new matching function. The result does not determine the length of the LCS anymore and therefore it will be considered as the highest score of correspondence (HSC).</Paragraph>
      <Paragraph position="6"> Furthermore, the string segmentation can be modified. The algorithm above does not require a segmentation into characters. Dynamic programming can be applied to string pairs which were split into larger units than single characters. The only requirement for this is an adequate definition of the string matching function for all possible pairs of string units.</Paragraph>
    </Section>
    <Section position="2" start_page="214" end_page="214" type="sub_section">
      <SectionTitle>
String Segmentation
</SectionTitle>
      <Paragraph position="0"> There is a common segmentation problem with units larger than one element. The problem arises in case of overlapping units within the string. A simple approach is to parse the string from left to right and to find the longest possible segment starting at the current position. The segmentation process starts again at the position directly after the last position of the previous segment. This approach was used for string segmentation in this study.</Paragraph>
    </Section>
    <Section position="3" start_page="214" end_page="214" type="sub_section">
      <SectionTitle>
Co-occurrence Statistics
</SectionTitle>
      <Paragraph position="0"> Co-occurrence can be measured by different statistical metrics. They can be estimated by frequency counts for the elements to be considered. The value of f(a) refers to the overall frequency counted for element a and the value of f(x, y) refers to the co-occurrence frequency of the elements x and y in the collection of N aligned units.</Paragraph>
      <Paragraph position="1"> The following formulas describe approximations of two commonly used metrics, Mutual Information (I) and the Dice coefficient (Dice) (Smadja et al., 1996; Church et al., 1991):  The proposed approaches to the generation of matching functions are based on the calculation of co-occurrence statistics. String units have to be matched at certain positions in order to measure co-occurrence frequencies. A so-called estimated position can be used to determine the position of the corresponding string unit. The following formula returns this value for the string pair Ix, y\] and the ith</Paragraph>
      <Paragraph position="3"> Case Folding and Alphabet Restrictions Case folding can be used to neutralize capitalization at the beginning of sentences. This can be useful for investigations of string similarity. However, valuable information can be lost especially when it comes to weighted matching functions. A higher priority for matching capitals would be desirable in cases of proper nouns. Furthermore, a reduced score might be useful when matching capitals with lower case characters. However, in this study case folding was applied.</Paragraph>
      <Paragraph position="4"> Furthermore, the alphabet of the elements which shall be considered in the generation of the string matching function can be restricted. Results can be influenced strongly by wrong scores for special symbols and low frequent elements. This phenomenon appears especially in the case of independent matching functions, implying e.g. automatically generated m-functions may include matches for non-identical digits.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="214" end_page="217" type="metho">
    <SectionTitle>
4 Generating the String Matching
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="214" end_page="215" type="sub_section">
      <SectionTitle>
Function
4.1 Approach 1: Map Characters (VCchar)
</SectionTitle>
      <Paragraph position="0"> The aim of this approach is to produce an independent matching function m based on a segmentation at the character level. The following heuristic is used: Pairs of vowels and consonants, respectively, which co-occur more frequently in the reference lexicon get a higher value in the m-function than lower frequency pairs.</Paragraph>
      <Paragraph position="1"> Pairs which do not co-occur at all get the function value 0.</Paragraph>
      <Paragraph position="2"> In this approach the matching function is generated in three steps: First, all vowels at similar estimated positions in word pairs from the reference lexicon are mapped to each other. Consonants are processed in a similar manner. Second, the frequencies for all elements in the alphabet are counted on both sides and the frequency of each unique character mapping determines the co-occurrence frequency for each pair of characters. Finally, the Dice coefficient is used to calculate a value for each pair of characters in the list of character mappings. This value is used for the corresponding pair of characters in the final string matching function m. The Dice coefficient was chosen because it produces values between 0 and 1. In this way, the resulting similarity score remains a value in the range of 0 and 1 which is to prefer.</Paragraph>
      <Paragraph position="3"> One problem arises with the definition of the set of vowels and consonants because the usage of letters can be context sensitive. For simplicity it was chosen to use a static disjunct definition of both sets (e.g. 'y' has been used as vowel only).</Paragraph>
      <Paragraph position="4">  pings, the first seven non-identical character mappings, and the first two mappings for each Swedish diacritic in the Swedish/English VCchar matching function.</Paragraph>
      <Paragraph position="5"> The resulting list (sorted after descending Dice scores) contains mainly pairs of identical letters on the top. Figure 2 shows, besides the seven highest rankings of pairs in the list, the first seven non-identical pairs, and mappings of Swedish diacritics which were obtained from the application to the Swedish/English reference lexicon GordSVEN.</Paragraph>
      <Paragraph position="6"> There are mappings of non-identical characters which are hard to retrace, e.g. the relation between 'a' and 'i'. However, most of the highest rankings of non-identical pairs reflect interesting connections between different characters in Swedish/English word pairs. Relations between 'k' and 'c' ('korrekt' 'correctly', 'kopia' - 'copy'), 'a' and 'e' ('beskriva' - 'describe', 'deformerad' - 'deformed'), 'v' and 'w' ('vatten' deg 'water', 'tv~' - 'two') can be recognized easily. Furthermore, the algorithm provides interesting weights for pairs of identical characters. The function shows that infrequent letters like '6' and 'x' can be matched with high confidence. In contrast with this, higher frequent characters with larger inconsistency like 'k', 'c', and 'w' obtain a lower value in this function, e.g. the match of the character 'c' in Swedish with the identical character 'c' in English will be scored with only 0.1123 points.</Paragraph>
      <Paragraph position="7"> The automatically generated m-function for matching character pairs was applied for string similarity calculation to the list of Swedish/English candidates from parts of the PLUG corpus. The program returned 1,449 alignment candidates with an estimated precision of 96.8% when using a threshold of 0.35</Paragraph>
    </Section>
    <Section position="2" start_page="215" end_page="216" type="sub_section">
      <SectionTitle>
4.2 Approach 2: Map Vowel and
Consonant Sequences (VCseq)
</SectionTitle>
      <Paragraph position="0"> The goal in this approach is to generate a function for matching pairs of vowel sequences and pairs of consonant sequences. The motivation for this study is to extend the segmentation of strings from the character level to an n-gram model. Similarly to approach 1 a reference lexicon is used to calculate co-occurrence statistics for pairs of elements from the alphabet of string units. However, the segmentation of strings has been changed. Each string from the lexicon is split into vowel sequences followed by consonant sequences and the other way around. Furthermore, these sequences may be interrupted by character sequences from the set of remaining elements in the alphabet (characters which are neither in the set of vowels nor in the set of consonants). Now, all vowel sequences and consonant sequences, respectively, at identical estimated positions are mapped to each other and the frequency of each unique mapping is counted. Similarly to approach 1, Dice scores are estimated by using overall frequencies for each character sequence and the frequencies of each pair in the list of mappings. Figure 3 shows some mappings from the application of this algorithm to the Swedish/English word list GordSVEN.</Paragraph>
      <Paragraph position="1"> Again, the pairs with the highest ranking are mainly identical strings. In contrast to the VCdeg char function there are already two non-identical pairs in the top-seven of the list. However, the co-occurrence frequency for them is very low (4 respective 2) and therefore the statistics are not very reliable. The value for the pair 'np' and 'dj' is due to four dictionary entries with morphological variants of ('anpassa','adjust') and ('anpassning','adjustment') and the low overall frequencies of 'np' and 'dj'. Similarly, the link between 'ktt' and 'bs' is due to three word pairs with variants of 'iakttagare' and 'observer' in the reference lexicon. A higher threshold for the co-occurrence frequency can be used to remove these pairs. However, a lot of interesting links would be lost in this way as well.</Paragraph>
      <Paragraph position="2"> The mappings for Swedish diacritics are not very reliable as reflected in their scores. These values will not influence the similarity measurements a lot.</Paragraph>
      <Paragraph position="3"> The program returned 651 candidates when applied to word pairs from the PLUG corpus with a  other string similarity metrics like LCSR because token pairs obtain a much lower score in average.</Paragraph>
      <Paragraph position="4">  non-identical pairs, and the first two mappings for each Swedish diacritic in the Swedish/English VCseq matching function.</Paragraph>
      <Paragraph position="5"> threshold of 0.15. The result yielded an estimated precision of 92.9%.</Paragraph>
    </Section>
    <Section position="3" start_page="216" end_page="217" type="sub_section">
      <SectionTitle>
4.3 Approach 3: Map Non-Matching Parts
</SectionTitle>
      <Paragraph position="0"> (NMmap) The last approach which will be discussed here differs from the other two by its general principle. In contrast to the other approaches the goal of the third approach is to extend a common matching function with some additional values for specific pairs. The basic matching function is represented by the m-function for LCS calculation (see section 3). Similarly to the other approaches a reference lexicon is taken to generate matching values for some specific pairs. Dynamic programming and a best trace computation can be used to identify non-matching parts of two strings. Now, these parts can be analyzed in order to find language pair specific correspondences.</Paragraph>
      <Paragraph position="1"> A simple idea is to match corresponding parts from the lists of non-matching strings to each other if they do not exceed a certain length. In this study a length of three character was chosen as a threshold. Consider figure 4 for an example of the mapping of non-matching parts for the Swedish/English word pair (kritiska,critical).</Paragraph>
      <Paragraph position="2"> Now, a weight for each pair of non-matching strings \[x, y\] can be calculated by dividing its frequency by the total number of non-matching mapksdegurcestrin Ikk I r Ji It li Isk I al I target string c r i t i c a 1 non-matching pairs: 'k' --+ 'c'  pair mappings in the Swedish/English matching function with a frequency higher than four.</Paragraph>
      <Paragraph position="3"> pings for the source string x. Figure 5 shows the seven non-matching pairs with the highest ranking and a frequency of more than four which were computed from the Swedish/English list of cognates ScaniaSVEN. null The mappings reflect some typical differences in the writing of similar words in these two languages. The relation between 'ska' and 'c' can be seen in word pairs like (asymmetriska,asymmetric) and (automatiska,automatic). Correspondences between 'k' and 'c' are common in a lot of Swedish/English pairs, e.g. in (korrekt,correct) and (funktion,function). The mapping of 'sk' and 'c' appears similarly to the mapping of 'ska' to 'c' but for indefinite singular forms of Swedish adjectives. The connection between 'ras' and 'd' can be found in passive voice constructions, e.g. in (rekommenderas,are recommended). The mapping of 'v' and 'w' is due to the fact that the letter 'w' does not exist in Swedish in practice. Finally, the change of vowels is quite common. The relation between 'e' and 'a' can be seen for example in pairs like (sida,side) and (lina,line). Furthermore, Swedish diacritics are represented by other characters in English. The mutation of '~i' to 'a' can be seen for example in the pair (m~k,mark) but as reflected in the corresponding matching value this is not as reliable as the match of e.g. 'k' and 'c'.</Paragraph>
      <Paragraph position="4"> The program returned 2,006 pairs with a score higher than 0.7 when applied to the Swedish/English word list which were obtained from parts of the PLUG corpus. This represents a gain of about 21%  additional links compared to the number of pairs which were obtained by calculating the basic LCSR scores and using the same threshold of 0.7. Even the estimation for the precision shows an improvement from 92.5% for LCSR extraction to 95.5% for the approach including non-matching scores.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML