XML Viewer - w03-0416

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0416_metho.xml
Size: 12,018 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0416">
  <Title>An Efficient Clustering Algorithm for Class-based Language Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Probability model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Class-based language modeling
</SectionTitle>
      <Paragraph position="0"> Our probability model is a class-based model and it is an extension of the model proposed by Li and Abe (1998).</Paragraph>
      <Paragraph position="1"> We extend their two-dimensional class model to a multi-dimensional class model, i.e., we incorporate an arbitrary number of random variables in our model.</Paragraph>
      <Paragraph position="2"> Although our probability model and learning algorithm are general and not restricted to particular domains, we mainly intend to use them in natural language processing tasks where large amounts of lexical knowledge are required. When we incorporate lexical information into a model, we inevitably face the data-sparseness problem.</Paragraph>
      <Paragraph position="3"> The idea of 'word class' (Brown et al., 1992) gives a general solution to this problem. A word class is a group of words which performs similarly in some linguistic phenomena. Part-of-speech are well-known examples of such classes. Incorporating word classes into linguistic models yields good smoothing or, hopefully, meaningful generalization from given samples.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Model definition
</SectionTitle>
      <Paragraph position="0"> Let us introduce some notations to define our model. In our model, we have considered D2 kinds of discrete random variables CG</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
BD
BNCG
BE
BNBMBMBMBNCG
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> In this paper, we have considered a hard clustering model, i.e., C8B4DCCYBVB5 BP BC for any DC BPBE BV. Li &amp; Abe's model (1998) is an instance of this joint probability model, where D2 BPBE. Using more than 2 variables the model can represent the probability for the co-occurrence of triplets, such as BOsubject, verb, objectBQ.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Clustering criterion
</SectionTitle>
      <Paragraph position="0"> To determine the proper number of classes in each par-</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
tition CC
BD
BNBMBMBMBNCC
</SectionTitle>
    <Paragraph position="0"> , we need criteria other than the maximum likelihood criterion, because likelihood always become greater when we use smaller classes. We can see this class number decision problem as a model selection problem and apply some statistically motivated model selection criteria. As mentioned previously (following Li and Abe (1998)) we used the MDL principle as our clustering criterion.</Paragraph>
    <Paragraph position="1"> Assume that we have C6 samples of co-occurrence  The objective function in both clustering and parameter estimations in our method is the description length, D0B4C5BNCBB5, which is defined as follows:</Paragraph>
    <Paragraph position="3"> where C5 denotes the model and C4  B4CBB5, is called the data description length. The second term, D0B4C5B5, is called the model description length, and when sample size C6 is large, it can be approximated as</Paragraph>
    <Paragraph position="5"> where D6 is the number of free parameters in model C5.</Paragraph>
    <Paragraph position="6"> We used this approximated form throughout this paper.</Paragraph>
    <Paragraph position="7"> Given the number of classes, D1  Our learning algorithm tries to minimize D0B4C5BNCBB5 by adjusting the parameters in the model, selecting partition</Paragraph>
    <Paragraph position="9"/>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Clustering algorithm
</SectionTitle>
    <Paragraph position="0"> Our clustering algorithm is a combination of three basic operations: CLASSIFY, SPLIT and MERGE. We iteratively invoke these until a terminate condition is met.</Paragraph>
    <Paragraph position="1"> Briefly, these three work as follows. The CLASSIFY takes a partition CC in BT as input and improves the partition by moving the elements in BT from one class to another. This operation is similar to one iteration in the K-means algorithm. The MERGE takes a partition CC as input and successively chooses two classes BV  takes a class, BV, and tries to find the best division of BV into two new classes, which will decrease the description length the most.</Paragraph>
    <Paragraph position="2"> All of these three basic operations decrease the description length. Consequently, our overall algorithm also decreases the description length monotonically and stops when all three operations cause no decrease in description length. Strictly, this termination does not guarantee the resulting partitions to be even locally optimal, because SPLIT operations do not perform exhaustive searches in all possible divisions of a class. Doing such an exhaustive search is almost impossible for a class of modest size, because the time complexity of such an exhaustive search is of exponential order to the size of the class. However, by properly selecting the number of trials in SPLIT, we can expect the results to approach some local optimum.</Paragraph>
    <Paragraph position="3"> It is clear that the way the three operations are combined affects the performance of the resulting class-based model and the computation time required in learning. In this paper, we basically take a top-down, divisive strategy, but at each stage of division we do CLASSIFY operations on the set of classes at each stage. When we cannot divide any classes and CLASSIFY cannot move any elements, we invoke MERGE to merge classes that are too finely divided. This top-down strategy can drastically decrease the amount of computation time compared to the bottom-up approaches used by Brown et al. (1992) and Li and Abe (1998).</Paragraph>
    <Paragraph position="4"> The following is the precise algorithm for our main  Step 4 Return the resulting partitions with the parameters in the model In the Step 0 of the algorithm, INITIALIZE creates the initial partitions of BT  into two classes and then applies CLASSIFY</Paragraph>
    <Paragraph position="6"> one by one, while any elements can move.</Paragraph>
    <Paragraph position="7"> The following subsections explain the algorithm for the three basic operations in detail and show that they decrease D0B4C5BNCBB5 monotonically.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Iterative classification
</SectionTitle>
      <Paragraph position="0"> In this subsection, we explain a way of finding a local optimum in the possible classification of elements in BT</Paragraph>
      <Paragraph position="2"> given the numbers of classes in partitions CC</Paragraph>
      <Paragraph position="4"> Given the number of classes, optimization in terms of the description length (Eq.2) is just the same as optimizing the likelihood (Eq.3). We used a greedy algorithm which monotonically increases the likelihood while updating classification. Our method is a generalized version of the previously reported K-means/EMalgorithm-style, iterative-classification methods in Kneser and Ney (1993), Berkhin and Becher (2002) and Dhillon et al. (2002). We demonstrate that the method is applicable to more generic situations than those previously reported, where the number of random variables is arbitrary.</Paragraph>
      <Paragraph position="5"> To explain the algorithm more fully, we define 'counter functions' CUB4BMBMB5 as follows:</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
BP DCCV
</SectionTitle>
    <Paragraph position="0"> where the hatch (AZ) denotes the cardinality of a set and</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DC
CZ
</SectionTitle>
    <Paragraph position="0"> is the CZ-th variable in sample DC. We used BCD0D3CVBCBPBC, in this subsection.</Paragraph>
    <Paragraph position="1"> Our classification method is variable-wise. That is, to classify elements in each BT  Step 2.3 Update the parameters by maximum likelihood estimation according to the updated partition. Step 3 Return improved partition CC</Paragraph>
    <Paragraph position="3"> In Step 2.3, the maximum likelihood estimation of the parameters are given as follows:</Paragraph>
    <Paragraph position="5"> In the last expression, each term in the summation (7) is AL BC according to the conditions in Step 2 of the algorithm. Then, the summation (7) as a whole is always AL BC and only equals 0 if no elements are moved. We can confirm that the summation (6) is positive, through an optimization problem: maximize the following quantity  . Through this, we can conclude that the summation (6) is AL BC. Therefore, A1B4D0D3CVC4B5 AL BC holds, i.e., CLASSIFY increases log likelihood monotonically. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 SPLIT operation
</SectionTitle>
      <Paragraph position="0"> The SPLIT takes a class as input and tries to find a way to divide it into two sub-classes in such a way as to reduce description length. As mentioned earlier, to find the best division in a class requires computation time that is exponential to the size of the class. We will first use a brute-force approach here. Let us simply try C2 random divisions, rearrange them with CLASSIFY and use the best one. If the best division does not reduce the description length, we will not change the class at all. It may possible to use a more sophisticated initialization scheme, but this simple method yielded satisfactory results in our experiment.</Paragraph>
      <Paragraph position="1"> The following is the precise algorithm for SPLIT:  the reduced description length produced by this split Step 3 Find the maximum reduction in the records Step 4 If this maximum reduction BQ BC, return the corresponding two classes as output, or return BV if the maximum AK BC Clearly, this operation decreases D0B4C5BNCBB5 on success and does not change it on failure.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 MERGE operation
</SectionTitle>
      <Paragraph position="0"> The MERGE takes partition CC as input and successively chooses two classes BV  . This operation thus reduces the number of classes in CC and accordingly reduces the number of parameters in the model. Therefore, if we properly choose the 'redundant' classes in a partition, this merging reduces the description length by the greater reduction in the model description length which surpasses the loss in log-likelihood.</Paragraph>
      <Paragraph position="1"> Our MERGE is almost the same procedure as that described by Li (2002). We first compute the reduction in description length for all possible merges and record the amount of reduction in a table. We then do the merges in order of reduction, while updating the table.</Paragraph>
      <Paragraph position="2"> The following is the precise algorithm for MERGE.</Paragraph>
      <Paragraph position="3"> In the pseudo code, AE</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CXCY
</SectionTitle>
    <Paragraph position="0"> denotes the reduction in D0B4C5BNCBB5 which results in the merging of BV  and store them in the table. It is clear from the termination condition in Step 3.2 that this operation reduces D0B4C5BNCBB5 on success but does not change it on failure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML