XML Viewer - w00-1218

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1218_metho.xml
Size: 12,314 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1218">
  <Title>A Clustering Algorithm-for Chinese Adjectives and Nouns 1</Title>
  <Section position="5" start_page="125" end_page="126" type="metho">
    <SectionTitle>
3 A Bidirectional Hierarchical
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="125" end_page="126" type="sub_section">
      <SectionTitle>
Clustering Algorithm
</SectionTitle>
      <Paragraph position="0"> Usually a hierarchical clustering algorithm \[7\] constructs a clustering &amp;quot;tree&amp;quot; by combining small clusters into large ones or (lividing large clusters into small ones.</Paragraph>
      <Paragraph position="1"> The bidirectional hierarchical clustering algorithm proposed by us is composed of two kinds of alternate clustering processes.</Paragraph>
      <Paragraph position="2"> The algorithm flow is described as follows: 1)Initially, regard every noun and adjective each as a cluster. Calculate the distances between clusters of the same part of speech.</Paragraph>
      <Paragraph position="3"> 2) Suppose without loss of generality that we choose to cluster nouns first.</Paragraph>
      <Paragraph position="4"> Select two noun clusters N, &amp; N s of the minimum distance and integrate them into a new one N~'.</Paragraph>
      <Paragraph position="5"> J) Calculate the collocational degree of the new cluster. Adjust the sequence numbers of the original clusters and the relational information of adjective clusters.</Paragraph>
      <Paragraph position="6">  4) Calculate the distances between the new cluster and other clusters.</Paragraph>
      <Paragraph position="7"> 5) Repeat from step 2) to 4) until the satisfaction of certain condition. For example, the number of the clusters haas decreased to certain amount. 2 6) Similarly, we can follow the same steps from 2) to 5) for constructing adjective clusters, completing one cycle, of clustering processes of nouns and adjectives.</Paragraph>
      <Paragraph position="8"> 7) Repeat from step 2) to 6) until the 2 In this paper, we set the proportion is 20%.  objective function 3 reaches the minimum value.</Paragraph>
      <Paragraph position="9"> One advantage of this algorithm is that: when two clusters of nouns have similar distribution environments, &amp;quot;they might be classified into one cluster. This information can be delivered to the clusters of adjectives that respectively collocate with them by the clustering process of nouns. Thus these clusters of adjectives have great possibility to be combined into one cluster, while the ordinary hierarchical clustering algorithm can not do it.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="126" end_page="126" type="metho">
    <SectionTitle>
4 An Objective Function Based on
MDL
</SectionTitle>
    <Paragraph position="0"> The objective function is designed to control the processes of clustering words based on the Minimum Description Length (MDL) principle. According to MDL, the best probability model for a given set of data is a model that uses the shortest code length for encoding the model itself and the given data relative to it \[4\] \[5\]. We regard the clusters as the model for the collocations of adjectives and nouns. The objective function is defined as the sum of the code length for the model (&amp;quot;model description length&amp;quot;) and that for the data (&amp;quot;data description length&amp;quot;). When the clustering result minimises the objective function, the bidirectional processes should be stopped and the result is the best probable one. The objective function based on MDL trade-offs between the simplicity of a model and its accuracy in fitting to the data, which are respectively quantified by the model description length and the data description length.</Paragraph>
  </Section>
  <Section position="7" start_page="126" end_page="126" type="metho">
    <SectionTitle>
3 Described later in section 4.
</SectionTitle>
    <Paragraph position="0"> The following are the formulas to calculate the objective function L:</Paragraph>
    <Paragraph position="2"> Lad is the model description length calculated as</Paragraph>
    <Paragraph position="4"> Where k A and k N respectively denote the number of clusters of adjectives and nouns. &amp;quot;+1&amp;quot; means that the algorithm needs one bit to indicate whether the collocational relationship between the two clusters exists.</Paragraph>
    <Paragraph position="5"> L,~ t is composed of the data description length of adjectives and that of nouns,</Paragraph>
    <Paragraph position="7"> And the two types of data description length are calculated as follows</Paragraph>
    <Paragraph position="9"/>
  </Section>
  <Section position="8" start_page="126" end_page="127" type="metho">
    <SectionTitle>
5 Our Experiment
</SectionTitle>
    <Paragraph position="0"> We take the words and collocations</Paragraph>
  </Section>
  <Section position="9" start_page="127" end_page="129" type="metho">
    <SectionTitle>
6 Discussions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="127" end_page="128" type="sub_section">
      <SectionTitle>
6.1 Rivisional Distance
</SectionTitle>
      <Paragraph position="0"> When we combine clusters into a new cluster, their distribution environments will be combined as well. The combination of clusters and their distribution environments might very likely generate redundant collocations that are not listed in the thesaurus. With the word clustering processes going on, there might be more and more redundant collocations. They will obviously affect the accuracy of the distances between clusters. When calculating the distances, the redundant collocations must be considered. So the question is how to revise the distance equation. Notice that the collocational degree defined in the above measures the collocational relationship gathered in Ni's Thesaurus \[6\] to test our algorithm. From Ni's thesaurus, we obtain 2,569 adjectives, 4,536 nouns and 37,346 collocations between adjectives and nouns.</Paragraph>
      <Paragraph position="1"> Table 1 shows results of using 5 different revisional distance formulas discussed in the next section. Because the length of this paper is limited, we only give some examples (10 clusters for each part of speech) of clusters in section 8. We can see that the redundant ratio obviously decreases by using the revisional distance, and the result that has the lowest redundant ratio corresponds of the minimum value of the objective function. By human evaluation, most clusters contain the words that have similar meanings and distribution environments. So our algorithm proves to be effective for word clustering based on collocations.</Paragraph>
      <Paragraph position="3"> However, the redundant ratio is still very large. The main cause is that existing between a cluster and its distribution environment. Obviously under the same instances are too sparse, covering only 0.32% of all possible collocations. So another advantage of our algorithm is that we can acquire many new reasonable collocations not gathered in the thesaurus, ff we add the new collocations into initial thesaurus and execute the algorithm on new data set, the performance will have great potential to improve. It is further work that can be carried out in the future.</Paragraph>
      <Paragraph position="4"> distance, clusters having higher coUocational degree have more higher similarity between each other (because they have more actual collocations) than those having lower collocational degree. So the collocational degree can be used to revise the distance equations.</Paragraph>
      <Paragraph position="5"> There are two problems that should-be considered when we design the revisional distance equations. The first one is to convert  the collocational degrees of two clusters into one collocational degree as the revisional factor for distance equations. It is the average collocational degree, marked as deg, calculated by deg A and null deg I ,I141 + deg A,I ,114 ua&gt;,l (II) deg N, Iv, llN, l + deg N, Iv, IN I In fact it is the collocational degree of the new cluster into which if we assume combining the two original clusters.</Paragraph>
      <Paragraph position="6"> The second problem is that the revjsonal distance equations should keep coherent of monotonicity with the original distance. It means that under the same average collocational degree, the revional distance should keep the same (or opposite) monotonicity with the original distance, and under the same original distance, the revional distance should keep the same (or opposite) monotonicity with the average collocational degree.</Paragraph>
      <Paragraph position="7"> In this paper, four simple revisional distance equations are presented based on consideration of the upper two problems. They are: a)dis'=-deglndis</Paragraph>
      <Paragraph position="9"> d) dis' = -dis In deg Where dis' denotes the revional distance and dis denotes the original distance.</Paragraph>
      <Paragraph position="10"> From the comparison of the upper different results (shown in Table 1), we can draw the conclusion that using revisonal distance equations can increase the clustering accuracy remarkably.</Paragraph>
    </Section>
    <Section position="2" start_page="128" end_page="129" type="sub_section">
      <SectionTitle>
6.2 Determinant of Objective Function's
Minimum Value
</SectionTitle>
      <Paragraph position="0"> The clustering algorithm terminates when the objective function is minimized.</Paragraph>
      <Paragraph position="1"> As a result it is very important to find out the function's minimum value. After analyzing the objective function, we find that it normally monotonically declines with clustering processes going on until it gets minimized. At the beginning, there are a large number of clusters with only one element in each of them. So the model description length is quite large while the data description length is quite small.</Paragraph>
      <Paragraph position="2"> Because the clustering process is hierarchical, every time when the combination occurs the number of clusters will decrease by one with the model description length's decreasing as well. At the same time the number of a certain cluster's elements will increase by one with the data description length's increment as well. However, the decrement is larger than the increment and it is getting smaller while the increment is getting larger. In this way, the objective function declines until the objective function reach its  addition, the clustering algorithm may help to find new collocations that are not in the thesaurus. This algodthm can also be extended to other collocation models, such as verb-noun collocations.</Paragraph>
      <Paragraph position="3">  minimum value. If we continue to execute the algorithm, we will see that the value of the objective function rises very fast like as is shown in Figure 1.</Paragraph>
      <Paragraph position="4">  Therefore we choose a fairly simple way to avoid the appearance of the local optimum: When there are two consecutive increases in the objective function during one clustering process, stop the process and start another one. When two consecutive clustering processes are stopped due to the same reason, we assume that we have got the minimum value and stop the whole clusterilag process. In our future work we will try to fred a better way to determine the minimum value of the objective function.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="129" end_page="129" type="metho">
    <SectionTitle>
7 Conclusion &amp; Future Work
</SectionTitle>
    <Paragraph position="0"> In this paper we have presented a bidirctional hierarchical clustering algorithm of simultaneously clustering Chinese adjectives and nouns based on their collocations. Our preliminary experiments show that it can distinguish different words by their distribution environments. In Our future work includes: 1) Because the sparsity of collocations is a main factor of affecting the word clustering accuracy, we can use the clustering results to discover new data and enrich the thesaurus.</Paragraph>
    <Paragraph position="1"> 2) As there are yet no adjustments to the hierarchical clustering results, we are considering using some iterative algorithm, such as K-means algofithm~ to optimise the clustering results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML