XML Viewer - w98-1214

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1214_metho.xml
Size: 26,040 bytes
Last Modified: 2025-10-06 14:15:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1214">
  <Title>CHOOSING A DISTANCE METRIC FOR AUTOMATIC WORD CATEGORIZATION</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Word Categorization
</SectionTitle>
    <Paragraph position="0"> Zipf, (Zipf, 1935), who is a linguist, was one of the early researchers in statistical language models. His work states that 66% of large English corpus will fall within the first 2,000 most frequent words. Therefore, the number of distinct structures needed to find an approximation to a large proportion of natural language would be small compared to the size of corpus that could be used. It can be claimed that by working on a small set consisting of frequent words, it is possible to build a framework for the whole natural language.</Paragraph>
    <Paragraph position="1"> N-gram models of language are commonly used to build up such a framework. An N-gram model can be formed by collecting the probabilities of word streams (wiIi = 1..n) where wi is followed by wi+l.</Paragraph>
    <Paragraph position="2"> These probabilities will be used to form the model where we can predict the behavior of the language up to n words. There exists current research that uses bigram statistics for word categorization in which probabilities of word pairs in the text are collected and processed.</Paragraph>
    <Paragraph position="3"> These n-gram models can be used together with the concept of mutual information to form the clusters. Mutua//nformation is based on the concept of entropy which can be defined informally as the unpredictability of a stochastic experiment. For linguistic categorization, mutual information calculated would denote the amount of knowledge preserved in the bigram statistics. The detailed explanation of mutual information and adapting the formulations for automatic word categorization process could be found in (Lankhorst, 1994).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Clustering Approach
</SectionTitle>
      <Paragraph position="0"> When the mutual information is used for clustering, the process is carried out somewhat at a macrolevel. Usually search techniques and tools are used together with the mutual information in order to form some combinations of different sets, each of which is then subject to some validity test. The idea used for the validity testing process is as follows.</Paragraph>
      <Paragraph position="1"> Since the mutual information denotes the amount of probabilistic knowledge that a word provides on the proceeding word, if similar behaving words would be collected together into the same cluster, then the loss of mutual information would be minimal. So, the search is among possible alternatives for sets or clusters with the aim to obtain a minimal loss in mutual information.</Paragraph>
      <Paragraph position="2"> Though this top-to-bottom method seems theoretically possible, in the presented work (Korkmaz~Uqoluk, 1997) a different approach, which is bottom-up, is used. In this incremental approach, set prototypes axe built and then combined with other sets or single words to form larger ones. The method is based on the similarities or differences between single words rather than the mutual information of a whole corpus. In combining words into sets a fuzzy set approach is used.</Paragraph>
      <Paragraph position="3"> Using this constructive approach, it is possible to visualize the word clustering problem as the problem of clustering points in an n-dimensional space if the lexicon space to be clustered consists of n words.</Paragraph>
      <Paragraph position="4"> The points which are the words of the corpus are positioned on this n-dimensional space according to their behavior relative to other words in the lexicon space. Each word is placed on the i th dimension according to its bigram statistic with the word representing the dimension namely wi. So the degree of similarity between two words can be defined as having close bigram statistics in the corpus. Words are distributed in the n-dimensional space according to those bigram statistics. The idea is quite simple: Let wl and w2 be two words from the corpus. Let Z be the stochastic variable ranging over the words to be clustered. Then if Px(wl, Z) is close to Px(w~, Z) and if Px(Z, wl) is close to Px(Z, w2) for Z rang-Lug over all the words to be clustered in the corpus, then we can state a closeness between the words Wl and w2. Here Px is the probability of occurrences of word pairs. Px (wl, Z) is the probability where ~ wl appears as the first element in a word pair and Px(Z, wl) is the reverse probability where wl is the second element of the word pair. This is the same for w2 respectively.</Paragraph>
      <Paragraph position="5"> In order to start the clustering process, a distance function has to be defined between the elements in Korkmaz and G6ktark (JC/oluk 112 Choosing A Distance Metric for Word Categorization</Paragraph>
      <Paragraph position="7"> the space. Assume that the bigram statistics for word couples are placed in a matrix N, where N~j denotes the number of times word-couple (w~, wj) is observed in the corpus. So formulating the similarity between two linguistic elements would be finding out the distance between two vectors that can be obtained from this matrix. Different distance metrics are proposed for the distance between vectors.</Paragraph>
      <Paragraph position="8"> The usage of a distance metric forms the main discussion point of this paper. In next section first the algorithm used for categorization will be presented and in section 4 these metrics and their usage for linguistic categorization will be discussed.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The Algorithm for Categorization
</SectionTitle>
      <Paragraph position="0"> Having a distance function, it is possible to start the clustering process. The first idea that can be used is to form a greedy algorithm to start forming the hierarchy of word clusters. If the lexicon space to be clustered consists of {wl,w2,...,wn}, then the first element from the lexicon space w~ is taken and a cluster with this word and its nearest neighbor or neighbors is formed. Then the lexicon space is {(wl, ws~, ..., w~), wi, ..., w,} where (wl, ws~, ..., ws~) is the first cluster formed. The process is repeated with the first element in the list which does not belong to any set yet (wi for our case) and the process iterates until no such word is left. The sets formed will be the clusters at the bottom of the cluster hierarchy. Then to determine the behavior of a set, the frequencies of its elements axe added and the previous process this time is carried on the sets rather than on single words until the cluster hierarchy is formed, so the algorithm stops when a single set is formed that contains all the words in.</Paragraph>
      <Paragraph position="1"> the lexicon space.</Paragraph>
      <Paragraph position="2"> In the early stages of this research such a greedy method was used to form the clusters. However, though some clusters at the low levels of the tree seemed to be correctly formed, as the number of elements in a cluster increased towards the higher levels, the clustering results became unsatisfactory.</Paragraph>
      <Paragraph position="3"> Two main factors were observed as the reasons for the unsatisfactory results.</Paragraph>
      <Paragraph position="4"> These were: * Shortcomings of the greedy type algorithm.</Paragraph>
      <Paragraph position="5"> * inadequacy of the method used to obtain the set behavior from the properties of its elements.</Paragraph>
      <Paragraph position="6"> The greedy method results in a non optimal clustering in the initial level. To make this point clearer consider the following example: Let us assume that four words wl,w2, w3 and w4 axe forming the lexicon</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
LEXICON SPACE 2
</SectionTitle>
    <Paragraph position="0"> gorithm in a lexicon space with four different words. Note that d~2.~ s is the smallest distance in the distribution. However since wl is taken into consideration, it forms setl with its nearest neighbor w2 and w3 combines with w4 and form set2, although w2 is nearer. And the expected third set is not formed.</Paragraph>
    <Paragraph position="1"> space. Furthermore, let the distances between these words be defined as dw~,wj. Then consider the distribution in Figure 1. If the greedy method first tries to cluster Wl, then it will be clustered with w2, since the smallest dwl,w, value is d~l,~ 2. So the second word will be captured in the set and the algorithm will continue the clustering process with w3. At this point, though w3 is closest to w2, it is captured in a set and since w3 is closer to w4 than the center of this set is, a new cluster will be formed with members w3 and w4. However, as it can be obviously seen visually from Figure 1 the first optimal cluster to be formed between these four words is the set The second problem causing unsatisfactory clustering occurs after the initial sets axe formed. According to the algorithm, the clusters behave exactly like other single words and participate in the clustering just as single words do. However to continue the process, the bigram statistics of the clusters should be determined. This means that the distance between the cluster and all the other elements in the search space have to be calculated. One easy way to determine this behavior is to find the average of the statistics of all the elements in a cluster. This method has its drawbacks. If the corpus used for the process is not large, the proximity problem becomes severe. On the other hand the linguistic role of a word may vary in contexts in different sentences.</Paragraph>
    <Paragraph position="2"> Many words axe used as noun , adjective or falling intosome other linguistic category depending on the context. It can be claimed that each word initially shall be placed in a cluster according to its dominant role. However to determine the behavior of a set the dominant roles of its elements should also be used.</Paragraph>
    <Paragraph position="3"> Somehow the common properties (bigrams) of the elements should be always used and the deviations of each element should be eliminated in the process.</Paragraph>
    <Paragraph position="4"> Korkmaz and G6kt~rk O~oluk 113 Choosing,4 Distance Metn'c for Word Categorization  The clustering process is improved to overcome the above mentioned drawbacks. To overcome the first problem the idea used is to allow words to be members of more than one cluster. So after the first pass over the lexicon space, intersecting clusters are formed. For the lexicon space presented in Figure 1 with four words, the expected third set will be also .formed. As the second step these intersecting sets are combined into a single set. Then the closest two words in each combined set (according to the distance function) are found and these two closest words are taken into consideration as the centroid for that set. After finding the centroids of all sets, the distances between a member and all the centroids are calculated for all the words in the lexicon space. Following this, each word is moved to the set where the distance between this member and the set center is minimal. This procedure is necessary since the initial sets are formed by combining the intersecting sets. When these intersecting sets are combined the set center of the resulting set might be far away from some elements and there may be other closer set centers formed by other combinations, so a reorganization of membership is appropriate.</Paragraph>
    <Paragraph position="5">  As presented in the previous section the clustering process builds up a cluster hierarchy. In the first step, words are combined to form the initial clusters, then those clusters become members of the process themselves. To combine dusters into new ones their statistical behavior should be determined. The statistical behavior of a cluster is related to the bigrams of its members. In order to find out the dominant statistical role of each cluster the notion of fuzzy membership is used.</Paragraph>
    <Paragraph position="6"> The problem that each word can belong to more than one linguistic category brings up the idea that the sets of word clusters cannot have crisp border lines and even if a word seems to be in a set due to its dominant linguistic role in the corpus, it can have a degree of membership to the other clusters in the search space. Therefore the concept of fuzzy membership can be used for determining the bigram statistics of a cluster.</Paragraph>
    <Paragraph position="7"> Researchers working on fuzzy clustering present a framework for defining fuzzy membership of elements. Gath and Geva (Gath, 1989) describe such an unsupervised optimal fuzzy clustering. They present the K-means algorithm based on minimization of an objective function. For the purpose of this research only the membership function of the algorithm presented is used. The membership function uij that is the degree of membership of the i th element to the jth cluster is defined as:</Paragraph>
    <Paragraph position="9"> Here Xi denotes an element in the search space, Vj is the centroid of the jth cluster. K denotes the number of clusters. And d2(Xi, Vj) is the distance of Xith element to the centroid Vj of the jth cluster.</Paragraph>
    <Paragraph position="10"> The parameter q is the weighting exponent for uij and controls the fuzziness of the resulting cluster.</Paragraph>
    <Paragraph position="11"> After the degrees of membership of all the elements of all classes in the search space are calculated, the bigram statistics of the classes are derived. To find those statistics the following method is used: For each subject cluster, the bigram statistics of each element is multiplied with its membership value. This forms the amount of statistical knowledge passed from the element to that set. So the elements chosen as set centroids will be the ones that affect a set's statistical behavior tile most.</Paragraph>
    <Paragraph position="12"> Hepce an element away from a centroid will have a lesser statistical contribution.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="111" type="metho">
    <SectionTitle>
4 Distance Metrics
</SectionTitle>
    <Paragraph position="0"> Various distance metrics have been proposed by mathematicians that can be used to formulate the similarity between vectors. Four of them are exmined and used for this study. The first one is the Manhattan Metric which just calculates the absolute difference between the values of two vector elements.</Paragraph>
    <Paragraph position="1"> It is defined by:</Paragraph>
    <Paragraph position="3"> are two vectors defined over 7~ n.</Paragraph>
    <Paragraph position="4"> Having such a metric it is possible to define the distance function between two linguistic elements. The distance function D between two words wl and</Paragraph>
    <Paragraph position="6"> Here the distance function consists of two different parts D1 and D2. This is because we want the distance function to be based on both proceeding and preceding words. So the first part denotes the distance on proceeding words and the second one denotes the distance obviously on the preceding words.</Paragraph>
    <Paragraph position="7"> If we use the Manhattan metric, the distance function would be :</Paragraph>
    <Paragraph position="9"> Here n is the total number of words to be clustered, gwai is the number of times word couple (wt, wi) is observed in the corpus and Niwa is the number of times word couple (wi, wl) is observed.</Paragraph>
    <Paragraph position="10"> Obviously it is the same for word w2. This distahoe metric just calculates the total difference on two vector-couples obtained from the frequency matrix N, where the first couple denotes the vectors obtained by the frequencies of the word-couples formed by wl, w2 and their proceeding words. The second couple denotes the vectors formed by the frequencies with the preceding words correspondingly.</Paragraph>
    <Paragraph position="11"> The above formulation explains the structure of the distance metric used for the study. For the researched presented in our previous paper (Korkmaz&amp;0~oluk, 1997) Manhattan Metric was the only metric used for the distance function. However others axe proposed for the similarity between vectors.</Paragraph>
    <Paragraph position="12"> Another metric is the Euclidean Metric:</Paragraph>
    <Paragraph position="14"> Here x and y axe again two vectors defined over 7~ n. Also the formulation of the angle between two vectors is also used for this study as a distance metric. If 0 is the angle between the two vectors x and y, then cos 0 is calculated by: x'y El &lt;i&lt;n xiyi COS</Paragraph>
    <Paragraph position="16"> Here, x'y denote the scalar product of the two vectors x and y and I x \] denote the magnitude of the vector x. Since the components of the vectors in our case are corresponding to the frequencies of words, they will be non-negative. So the angle between the two vectors will be between 0 deg and 90 deg. Since cos 0 deg is unity and cos 90 deg is zero, a distance metric between the two vectors can be defined as:</Paragraph>
    <Paragraph position="18"> This distance metric will give us a number from the closed interval \[0, 1\], 0 denoting that the two vectors are overlapping and 1 denoting that there is an angle of 90 deg which is the highest difference between the vectors.</Paragraph>
    <Paragraph position="19"> The last distance metric used for the similarity function is the Spearman Rank Correlation Coefficient. This metric is based on the difference between the ranks of two vectors rather than the difference between their elements. The metric is defined as:</Paragraph>
    <Paragraph position="21"> Here x and y axe again two vectors as defined above. Ri z nd R/~ are the ranks of the corresponding vectors. The rank is calculated for our case by normalizing the vectors in the interval \[0,1\]. The component with the highest value among the components of the vector takes the value 1 and if there axe n elements in the vector, the one with the second highest value will correspond to the number 1-(l/n) and so on. The smallest value will correspond to zero.</Paragraph>
    <Paragraph position="22"> For the process of formulating the distance between linguistic elements, the main problem appears due to the difference between the frequencies of words from the same linguistic category. For instance the word go has a very high frequency in natural language corpora compared to many other verbs, but still we have to cluster go with low frequency verbs. However if we use a distance metric based on only the absolute differences of vectors like the Euclidean Metric or Manhattan Metric, the distance calculated between high frequency and low frequency words would be high, which is undesired.</Paragraph>
    <Paragraph position="23"> Therefore when comparing a high frequency word with a low frequency one, we should be able to determine if the difference is caused by some regular magnitude difference. A similarity can exist between th e corresponding values when this magnitude difference is discarded. Without having a distance function that compensates for this, it is not possible to overcome the errors introduced by having different frequencies for words from the same linguistic category. This acts as a considerable factor disturbing the quality of formed clusters.</Paragraph>
    <Paragraph position="24"> Having this in mind the Spearman Rank Correlation Coefficient Metric and the Angle Metric are used as distance function. These two seem to discard the magnitude difference between the components of the vectors. Such a comparison seems to be more suitable for evaluating the similarity of linguistic elements.</Paragraph>
    <Paragraph position="25"> In the Spearman Rank Correlation Coefficient the vectors are normalized into the closed interval \[0,1\]. So the vectors are similar if the change from one component to the next is similar, regardless of the difference in the absolute values. We have a similar comparison for the Angle Metric. When this metric Korkmaz and GOlaiirk @oluk 115 Choosing A Distance Metric for Word Categorization</Paragraph>
    <Paragraph position="27"> disappeared. We were able to get an initial success rate of about 90% with the Manhattan Metric when we discarded this large faulty cluster. However with the other metrics this success rate has been obtained for all the lexicon space.</Paragraph>
    <Paragraph position="28"> The second problem encountered for the categorization process appears while combining the initial clusters into larger ones* Although it is possible to obtain some local successful combinations with the first metric, the overall performance in combining these initial clusters is not so satisfactory. So different metrics presented in section 4 have been tested on the algorithm. Unfortunately, although the proposed metrics were able to overcome the first problem of having a large faulty cluster, the progress obtained in combining initial clusters into larger ones was not so significant. This has been the factor triggering the idea that a metric taking into consideration both of the approaches for linguistic similarity would be more suitable for our case. So the fifth metric, the Combined Metric, has been constructed.</Paragraph>
    <Paragraph position="29"> The main progress obtained with this fifth metric is on the second problem described.</Paragraph>
    <Paragraph position="30"> In table 2 the hierarchies obtained using different metrics are presented. When the properties presented in this table are examined, the hierarchy formed by the Manhattan Metric has the minimum number of initial clusters. This is due to the large faulty cluster formed with this metric. The properties of the hierarchies presented in table 2 seem to be similar to each other. Only the depth of the tree formed with the Angle Metric differs from the other ones. This is because more initial clusters are combined on the second level in the hierarchy obtained with this metric. This brings in an increase in the number of ill-structured clusters on the second level over-combining distinct linguistic categories.</Paragraph>
    <Section position="1" start_page="0" end_page="111" type="sub_section">
      <SectionTitle>
5.1 Empirical Comparison
</SectionTitle>
      <Paragraph position="0"> The main progress for the clustering hierarchy is obtained by the Combined Metric. It seems suitable to examine this metric in detail and compare the results with the initial organization obtained by the Manhattan Metric.</Paragraph>
      <Paragraph position="1"> Some linguistic categories inferred by the algorithm using the Combined Metric are listed below: * professor opposite church hall least present once last baby prisoner doctor wind gate village sun country  some not such its The ill-placed members in the clusters above are shown using bold font. The above initial clusters represent the linguistic categories with a success rate of 90.2%. Also the plural nouns in singular noun clusters are shown in italics. If we consider those placements as faulty ones also, the calculated success rate would fall to 88.1%. This success rate seems to be similar to the results obtained with other distance metrics. However as explained above the main progress obtained with this Combined Metric is on the process of combining these initial clusters into larger ones in the upper levels of the duster hierarchy. null Two examples from the cluster hierarchy obtained with this metric are given in tables 4 and 5. In table 4 94 nouns coming from different initial clusters are combined in the same part of the cluster hierarchy. Only one cluster seems to be misplaced in this region. This is an adjective cluster. In table 5 67 different verbs are collected. They are all present tense verbs and no misplaced word exists in tiffs part of the hierarchy. This is another well-formed part of the cluster organization. It is believed that this is an important improvement compared to earlier results, since there is an increase in the number of successfully connected initial clusters.</Paragraph>
      <Paragraph position="2"> Table 3 exhibits the improvement obtained using the Combined Metric. Maximum number of words correctly classified for some linguistic categories are shown in this table. Obviously there are other clusters having dements from the same linguistic categories in different parts of the hierarchy. This table makes a comparison of the maximum numbers of words successfully collected in order to analyze the improvement obtained. Gathering nouns and auziliaries seems to be carried out better with the Manhattan Metric. However if we consider the number of initial clusters forming these largest ones, a signilicant progress seems to exist for the Combined Metric. There is a big difference for these numbers between the two. For instance 12 present perfect verb classes are combined successfully when the Combined Metric is used, but only 8 of them were combined with the Manhattan Metric. For adjectives this is 7 to 2, for past perfect verbs 5 to 1 and although number of nouns collected by the Manhattan Metric is larger, number of initial clusters suecessfully combined by the Combined Metric is still larger.</Paragraph>
      <Paragraph position="3"> It can be claimed that there is a significant progress in the process of successfully combining the initial clusters when the new metric is used. This was the main problem encountered with the Manhat'tan Metric and the other ones. This is denoted as the progress obtained by using the Combined Metric trying to represent both of the two approaches that can be taken into account for the similarity of linguistic dements.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML