XML Viewer - a88-1028

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/a88-1028_metho.xml
Size: 12,048 bytes
Last Modified: 2025-10-06 14:12:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="A88-1028">
  <Title>COMPUTATIONAL TECHNIQUES FOR IMPROVED NAME SEARCH</Title>
  <Section position="4" start_page="203" end_page="203" type="metho">
    <SectionTitle>
SOUNDEX and NYSIIS (Taft 1970) techniques.
2.0 CURRENT NAME SEARCH PROCEDURES
</SectionTitle>
    <Paragraph position="0"> In current name search procedures, a search request is reduced to a canonical form which is then matched against a database of names also reduced to their canonical equivalents. All names having the same canonical form as the query name will be retrieved. The intent is that similar names (e.g., Cole, Kohl, Koll) will have identical canonical forms and dissimilar names (e.g., Cole, Smith, Jones) will have different canonical forms. Retrieval should then be insensitive to simple transformations such as spelling variants. Techniques of this type have been reviewed by Moore et al. (1977).</Paragraph>
    <Paragraph position="1"> However, because of spelling variation in proper names, the canonical reduction algorithm may not always have the desired characteristics.</Paragraph>
    <Paragraph position="2"> Sometimes similar names are mapped to different canonical forms and dissimilar names mapped to the same forms. This is especially true when 'foreign' or non-European names are included in the database, because the canonical reduction techniques such as SOUNDEX and NYSIIS are very language-specific and based largely on Western European names. For example, one of the SOUNDEX reduction rules assumes that the characteristic shape of a name is embodied in its consonants and therefore the rule deletes most of the vowels. Although reasonable for English and certain other languages, this rule is less applicable to Chinese surnames which may be distinguished only by vowel (e.g., Li, Lee, Lu).</Paragraph>
    <Paragraph position="3"> In large databases with diverse sources of names, other name conventions may also need to be handled, such as the use of both matronymic and patronymic in Spanish (e.g., Maria Hernandez Garcia) or the inverted order of Chinese names (e.g., Li-Fang-Kuei, where Li is the surname).</Paragraph>
  </Section>
  <Section position="5" start_page="203" end_page="206" type="metho">
    <SectionTitle>
3.0 LANGUAGE CLAS SIFICATION
</SectionTitle>
    <Paragraph position="0"> As mentioned in section 1.0, the approach taken to improve existing name search techniques was to first classify the query name as to language source and then use language-specific rewrite rules to generate plausible name variants. A statistical classifier based on Hidden Markov Models (HMM) was developed for several reasons. Similar models have been used successfully in language identification based on phonetic strings (House and Neuburg 1977, Li and Edwards 1980) and text strings (Ferguson 1980).</Paragraph>
    <Paragraph position="1"> Also, HMMs have a relatively simple structure that make them tractable, both analytically and computationally, and effective procedures already exist for deriving HMMs from a purely statistical analysis of representative text.</Paragraph>
    <Paragraph position="2"> HMMs are useful in language classification because they provide a means of assigning a probability distribution to words or names in a specific language. In particular, given an HMM, the probability that a given word would be generated by that model can be computed. Therefore, the decision procedure used in this project is to compute that probability for a given name against each of the language models, and to select as the source language that language whose model is most likely to generate the name.</Paragraph>
    <Section position="1" start_page="203" end_page="206" type="sub_section">
      <SectionTitle>
3.1 EXAMPLE OF HMM MODELING TEXT
</SectionTitle>
      <Paragraph position="0"> The following example illustrates how HMMs can be used to capture important information about language data. Table 1 contains training data representing sample text strings in a language corpus. Three different HMMs of two, four and six states, were built from these data and are shown in Tables 24, respectively. (The symbol CR in the tables corresponds to the blank space between words and is used as a word delimiter.) These HMMs can also be represented graphically, as shown in Figures 1-3. The numbered circles correspond to states; the arrows represent state transitions with non-zero probability and are labeled with the transition probability. The boxes contain the probability distribution of the output symbols produced when the model is in the state to which the box is connected. The process of generating the output sequence of a model can then be seen as a random traversal of the graph according to the probability weights on the arrows, with an output symbol generated randomly each time a state is visited, according to the output distribution associated with that state.</Paragraph>
      <Paragraph position="1"> For example, in the two-state model shown in Table 2 (and graphically in Figure 1), letter (nondelimiter) symbols can be produced only in state two, and the output probability distribution for this state is simply the relative frequency with which each letter appears in the training data. That is, in the training data in Table 1 there are 15 letter symbols:</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"> five &amp;quot;a&amp;quot;, four &amp;quot;b&amp;quot;, three &amp;quot;c&amp;quot;, etc., and the model assigns a probability of 5/15 = 0.333 to &amp;quot;a&amp;quot;, 4/15 = 0.267 to &amp;quot;o&amp;quot;, and so on. Similarly, the state transition probabilities for state two reflect the relative frequency with which letters follow letters and word delimiters follow letters. These parameters are derived strictly from an iterative automatic procedure and do not reflect human analysis of the data.</Paragraph>
      <Paragraph position="6"> In the four state model shown in Table 3 (and Figure 2), it is possible to model the training data with more detail, and the iterations converge to a model with the two most frequently occuring symbols, &amp;quot;a&amp;quot; and &amp;quot;b&amp;quot;, assigned to unique states (states two and four, respectively) and the remaining letters aggregated in state three. State one contains the word delimiter and transitions from state one occur only to state two, reflecting the fact that &amp;quot;a&amp;quot; is always word-initial in the training data.</Paragraph>
      <Paragraph position="7"> In the six state model shown in Table 4 (and Figure 3), the training data is modeled exactly. Each state corresponds to exactly one output symbol (a letter or word delimiter). For each state, transitions occur only to the state corresponding to the next allowable letter or to the word delimiter.</Paragraph>
      <Paragraph position="8"> The outputs generated by these three models are shown in Table 5. The six state model can be used to model the training data exactly, and in general, the faithfulness with which the training data are represented increases with the number of states.</Paragraph>
    </Section>
    <Section position="2" start_page="206" end_page="206" type="sub_section">
      <SectionTitle>
3.2 HMM MODEL OF SPANISH NAMES
</SectionTitle>
      <Paragraph position="0"> The simple example in the preceding section illustrates the connection between model parameters and training data. It is more difficult to interpret models derived from more complex data such as natural language text, but it is possible to provide intuitive interpretations to the states in such models. Table 6 shows an eight state HMM derived from Spanish surnames. State transition probabilities are shown at the bottom of the table, and it can be seen that the transition probability from state eight to state one (word delimiter) is greater than .95. That is, state eight can be considered to represent a &amp;quot;word final&amp;quot; state. The top part of the table shows that the highest output probabilities for state eight are assigned to the letters &amp;quot;a,o,s,z&amp;quot;, correctly reflecting the fact that these letters commonly occur word final in Spanish Garcia, Murillo, Fuentes, Diaz. This HMM also &amp;quot;discovers&amp;quot; linguistic categories, such as the class of non-word-final vowels represented by state seven with the highest output probabilities assigned to the vowels &amp;quot;a,e,i,o,u&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="206" end_page="206" type="sub_section">
      <SectionTitle>
3.3 LANGUAGE CLASSIFICATION
</SectionTitle>
      <Paragraph position="0"> In order to use HMMs for language classification, it was first necessary to construct a model for each language category based on a representative sample. A maximum likelihood (ML) estimation technique was used because it leads to a relatively simple method for iteratively generating a sequence of successively better models for a given set of words. HMMs of four, six and eight states were generated for each of the language categories, and an eight state HMM was selected for the final configuration of the classifier. Higher dimensional models were not evaluated because the eight state model performed well enough for the application.</Paragraph>
      <Paragraph position="1"> With combined training and test data, language classification accuracy was 98% for Vietnamese, 96% for Farsi, 91% for Spanish, and 88% for Other. With training data separate from test data, language classification accuracy was 96% for Vietnamese, 90% for Farsi, 89% for Spanish, and 87% for Other. The language classification results are shown in Tables 7 and 8.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="206" end_page="209" type="metho">
    <SectionTitle>
4.0 LINGUISTIC RULE COMPONENT
</SectionTitle>
    <Paragraph position="0"> For each of the three language groups, Vietnamese, Farsi and Spanish, a set of linguistic rules could be applied using a general rule interpreter. The rules were developed after studying naming conventions and common transcription variations and also after performing protocol analyses to see how native English speakers (mis)spelled names pronounced by native Vietnamese (and Farsi and Spanish) speakers and (mis)pronounced by other English speakers. Naming conventions included word order (e.g., surnames coming first, or parents' surnames both used); common transcription variations included Romanization issues (e.g., Farsi character that is written as either 'v' or 'w').</Paragraph>
    <Paragraph position="1"> The general form of the rules is lhs --&gt; rhs / leftContext rightContext where the left-hand-side (lhs) is a character string and the right-hand-side is a string with a possible  C goes to K when it precedes A.</Paragraph>
    <Paragraph position="2"> J goes to J, H or G when it is word initial. J-&gt;JIHIG/#_ Y goes to Y or I Y-&gt;YII/.</Paragraph>
    <Paragraph position="3"> when it is not word initial.</Paragraph>
    <Paragraph position="4"> F goes to F or V F-&gt;FIV\[_.</Paragraph>
    <Paragraph position="5"> when it is not word final.</Paragraph>
    <Paragraph position="7"> C goes to C or S when it precedes E or I.</Paragraph>
    <Paragraph position="8"> H goes to H or J when it follows a letter other than C or S, and precedes A,E,I,O, or U. T goes to T or D 'when it is word initial and precedes a letter other than R.</Paragraph>
    <Paragraph position="9"> IE goes to IE, I or Y when it is word final O goes to O, E or U when it follows S and precedes final N.</Paragraph>
    <Paragraph position="10">  weight, so that the rules could be associated with a plausibility factor.</Paragraph>
    <Paragraph position="11"> Rules may include a specific context; if a specific environment is not described, the rule applies in all cases. Table 9 shows sample rules and examples of output strings generated by applying the rules. The 'N/A' column gives examples of name strings for which a rule does not apply because the specified context is absent. An example with plausibility weights is also shown.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML