File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2212_metho.xml

Size: 13,129 bytes

Last Modified: 2025-10-06 14:14:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2212">
  <Title>Hierarchical Clustering of Words</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word Bits Construction
</SectionTitle>
    <Paragraph position="0"> Our word bits coiJstruction &lt;~lgorMlm is ;~ lno(tiflotation mid mi extension of the mutual infornm.</Paragraph>
    <Paragraph position="1"> l, ion chistering Mgorithm proposed })y l}rown et ill, (1992). We will first illustrate the dilTereltce between file original rormuh~e iul(t the oues we used, lind theft introduce the word bits co,.st.ruction Mgorithni. We will use the same no(.aA;ion ;ks ill Ilrown et M. to tm.Lke the conll);trison e;~sier.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Mutual Infi)rmatlon Clust(;rhig
Algorithm
</SectionTitle>
      <Paragraph position="0"> MutuM information chlstering niethod enlploys a t)ottuni-up merging t)roce,(hire with the ~wel'i%ge lllUl.ll&amp;\] illfOrlrllttioll (AMI) or ;sit,cent. classes in the text ms an o\[)jective hmction. In the iltitial sta, ge, er~(:h word in the vocM)ul&lt;'u'ly of size V is i~ssig.,e(l to il;s own (listii,(:t class. We then inerge (,wo cla.sses if die merging of {;hem induces \[.i,, immn AMI reduction arllong all pMrs of classes, ttll(\] we rei)e~d; the nlergitlg 8(,e f) until {,tie Numl)er of the (:lasses is reduced to the pre(leliiied nunit)er C,. Time colitplexity or this basic algorithnl is O(V 5) when iinph;rueui,ed sl, rMglitforwardly, l}y storing the resu\]\[, of all the trim nierges ~(, (,lie previous inerging step, however, the tinie coniplexity call be reduced to O(V 3) ;~s shown t)elow.</Paragraph>
      <Paragraph position="1"> Suppos(; dia.t, stm'thig with V ch~sses, we have Mre;uiy made V - k nlerges, lelwing k (:lasses, (.:~.(J), (:~(2), .. , c.'~(,:)_ The AMI i~t, ~his sti~ge is given by the rollowillg e(llmtions.</Paragraph>
      <Paragraph position="3"> In equ;-~tion \[, qh's ;~re sunlrHed over l, he entire k X k (:lass bigrlun table ill which (l,lu) cell rel)reseilts qx+,(f,m), hi this irlerging step we invesi;igate a trial merge or c'~(i) mM (:~(j) tbr MI (:h~ss pairs (i,j), lind con,pure the AMI reduction L~(i,j) I~: - l~:(i,j) efre(:i~ed by this .~erge, where l~:(i,j) is tile AMI aft;or the lilertre,.</Paragraph>
      <Paragraph position="4"> Suppose that the I);dr (Cx:(i),C);(j)) was cho sen to merge, thai. is, l,~(i,j) ~ L~,(l,m) for M1  pairs (l,m). In the next merging st.el) , we have L~;'J)tl m) for all the pairs (l,m). to cMculate ~-1~ , ltere we use the superscript (i, j) to indicate that (Ck (i), Ck (j)) w as merged in the previous merging step.</Paragraph>
      <Paragraph position="5"> Now note that the difference between L (i'j)(l m) and L~(l,m) only comes fronl k-1 ~&amp;quot; the terms which are affected by mergiug the pair (C~(i),C~:(j)).</Paragraph>
      <Paragraph position="6"> Since L~.(l, rn) = I~-1~(I, m) and L&amp;quot;'J)(l m) = k,-l~ , - we have -- , -- ( l(i'J) -- lk ), m)) + Some part of the summation region of I~'j~)(l, ,n) and I~ cancels out with a part of l~i;~ ) or a part of a(t,.,). Let &amp;quot;0, i (t .0, i and i~ denote the values of l~iLJ)(l, rn),lt:(l,m),l~i'_J 1) and I~, respectively, after all the common terms among theln which Call be canceled are canceled out. Then, we have</Paragraph>
      <Paragraph position="8"> Because equation 3 is expressed as the summation of a fixed number of q's, its value can be cMculated in constant time, whereas the cMculation of e(tuation 1 requires O(V 2) time. Therefore, the total time complexity is reduced by O(V~).</Paragraph>
      <Paragraph position="9"> The summation regions of I's in equation 3 are illustrated in Figure 1. Brown et al. seem to have ignored the second term of the right hand side of equation 3 and used only the first term to calculate L~i,J~(l,m)_Lk(l,m) 1. However, since thesecoud term has as much weight as the first terln, we used equation 3 to mgke the model complete.</Paragraph>
      <Paragraph position="10"> Even with the O(V a) algorithm, the calculation is not practical for a large vocabulary of order 10 4 or higher. Brown et al. proposed the following 1A(:tually, it is the first term of equation 3 times (-l) that appeared in their paper, but we believe that it is simply due to a misprint.</Paragraph>
      <Paragraph position="12"> method, which we also adopted. We first make V singleton classes out of the V words in the vocabulary and arrange the (:lasses in descending order of frequency, then define the merging region as the first C+ l positions in the sequence of classes. At each merging step, merging of only the (:lasses in the merging region is considered, thus reducing the number of trial merges from O(V 2) to O(C'~).</Paragraph>
      <Paragraph position="13"> After each actual merge, the most frequent singleton class outside of the merging region is shifted into the region. With this algorithm, the time complexity is reduced to O(C ~ V).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Word Bits Construction Algorithm
</SectionTitle>
      <Paragraph position="0"> The simplest way to construct a tree structured representation of words is to construct a dendrogram from the record of the merging order. A simple example with a five-word vocabulary is shown in Figure 2. If we apply this method to the above O(C'2V) algorithm, however, we obtain for each class an extremely unbalanced, Mmost left branching subtree. The reason is that after classes in the merging region are grown to a certain size, it is much less expensive, in terms of AM1, to merge a singleton class with lower frequency into a higher frequency class than merging two higher frequency classes with substantiM sizes.</Paragraph>
      <Paragraph position="1"> A new approach we adopted is as follows.</Paragraph>
      <Paragraph position="2">  1. Ml-clustering: Make C classes using the mutual information clustering algorithm with the merging  the classes are merged into a singe (:lass. Make a dendrogram out of this process. This dendrograrn, 1),.ooC/, constitutes the upper part of the final tree. a. l,),~.,.-,:~,,s~.,.i,,,j: Let {C(I), C(2),..., c(c)} ))e the set of the classes obtained at, step l. l,'or each i (1 &lt; i &lt; C) do the following.</Paragraph>
      <Paragraph position="3">  (3) Replace all words in the text except those in C(i) with their &lt;:lass token, l)efine a new vocabu-lary V' = V1 U V&gt; where V1 = {all the words in (\](i)}, V 2 = {C'l,(\]2,...,C.,i_l,C'i+l,C,c} , and 65 is a token for (:(j) for I &lt; j &lt; C. Assign each element in V' to its own class and execute binm:y merging with a merging constraint such (,ha(, only those classes which only contaiu elements of Vl can be merged.</Paragraph>
      <Paragraph position="4"> (b) Repeat merging until all the eletuents in VI  are i)ut in a single (:lass.</Paragraph>
      <Paragraph position="5"> Make a dendrogrmn l).~,d~ out of the merging protess for each class. This (teudrogram coust, itutes a subtree for each (:lass with a leaf node rel)resenting each word in the class.</Paragraph>
      <Paragraph position="6"> 4. Combine tile dendrograms by substituting each leaf node of l)root with coresponding l),,Lb This algorithm produces a b,~lanced binary tree represent;ation of words in which (,hose words which are close in meaning or syntactic feature come close in posit, ion. Figure 3 shows an exampie of l).,~b for orle class out of 500 (:lasses constructed using this algorithm wit|) a vocabulary of the 70,000 most; frequently occm:ring words in the Wall Street; Journal Corpus. Finally, by tracing the path from the root node to a leaf node aud assigning a bit to each bra, uch with zero or one representing a left or right branc\]b respectively, we car, assign a bit-string (word bits) to each word in the vocabulary.</Paragraph>
      <Paragraph position="7"> ~\]n the actuM implement~ttion, we only htwe to work on the bigr~ml t*Lble instead of tim whole text.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="1161" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> We used phdu texts from six years of tile WSJ C, ort)us to create word bits. The sizes of tile texts are 5 million words (MW), t0MW, 20MW, and 50M W. '|'he vocabulary is selected as the 70,000 most; fl:eqneutly occurring words in the entire co&gt; pus. We set the number C of &lt;:lasses to 500.</Paragraph>
    <Paragraph position="1"> The obtained hierarchical clusters are ewdua.ted via the error rate of the ATI{ l)ecision-Tree Part--Of-Speech Tagger which is based on SPAT'\['I,;t{ (Magerman 199,1). The tagger employs a set of 443 syntactic tags. In the training phase, a set of events are extracted from the training texts. An event is a set of feature-value pairs or question-answer pairs. A feature can be any attribute of the context in which the current word word(O) appears; it is conveniently expressed as a question.</Paragraph>
    <Paragraph position="2"> Figure 4 shows an example of an evetlt, with a current word &amp;quot;like&amp;quot;. The last \[)air in the event is a special item which shows the answer, i.e., the col rect tag of the current word. The first three lines show questions about identity of words around tile current word and tags for previous words. These questions are cMled basic que.slio~,s and always used. The second type of questions, word bits questions, are on clusters and word bits such as what is the 29th bit of the previous word's word bits?. The third type of questkms are cMled lingui.sl's questiona and these are compiled by an expert grmlmmrian.</Paragraph>
    <Paragraph position="3"> Out of the set of events, a decision tree is constructed whose leaf nodes contain conditionM probability distributi(ms of tags, conditioned by the feature values. In tile test phase the system looks up conditionM probability distributions of tags R)r eat:l, word in the test text and chooses the most probable tag sequences using beam search.</Paragraph>
    <Paragraph position="4"> We used WSJ texts and the ATI{ cor\[ms (lllack et al. 1996) for the tagging experiment. Both col pora use the ATR syntactic tag set. Since the ATR corpus is still in the process of development, the size of the texts we have at hand for this experiment is rather ndnimal considering tim large size of the tag set. Table 1 shows the sizes of texts used for the experiment;. Figure 5 shows the t;agging error rat;es plotted against various clustering  text sizes. Out of the three types of questions, basic questions and word bits questions are always used in this ext)eriment. 'lb see the effect of introducing word bits information into the tagger, we performed a separate experiment in which a randomly generated bit-string is assigned to each word 3 and basic questions and word bits questions are used. The results are plotted at zero clustering text size. For both WSJ texts and ATR corpus, the tagging error rate dropped by more than 30% when using word bits information extracted from the 5MW text, and increasing the clustering text size further decreases the error rate. At 50MW, the error rate drops by 43%. This shows the ira: provement of the quality of the hierarchical clusters with increasing size of the clustering text. In Figure 5, introduction of linguistic questions 4 is also shown to significantly reduce the error rates for the WSJ corpus. The dependency of the error rates on the clustering text size is quite sin&gt; liar to the ea.se in which no linguistic questions are used, indicating the effectiveness of combin3Since a distin&lt;:tive bit-string is assigned to each word, the tagger also uses a bit-string as an ID number for each word in the process, In this control experiment bit-strings are assigned in a random way, but no two words are assigned the same word lilts. Random word bits are expected to give no class information to the tagger except for the identity of words.</Paragraph>
    <Paragraph position="5">  the initial stage of development and are by no means comprelmnsive.</Paragraph>
    <Paragraph position="6"> ing automatically created word bits and hand-crafted linguistic questions. Figure 5 also shows that reshuming the classes several times just after step I (MLclustering) of the word bits construction process filrther improves the word bits. One round of reshuffling corresponds to moving each word in the vocabulary from its original (:lass to another class whenever the movement increases the AMI, starting from the most frequent word through the least frequent one. The figure shows the error rates with zero, two, and five rounds of reshufi\]ing 5. Overall high error rates are attributed to the very large tag set; and the small training set. Another notable point in the figure is that introducing word bits constructed from WSJ texts is as effective for tagging Aq'R text.s as it is for tagging WSJ texts even though these texts are from very different domains. To \[;hat extent, the obtained hierarchical clusters are considered to be portable across domains.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML