XML Viewer - w00-0104

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0104_metho.xml
Size: 12,156 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0104">
  <Title>Automatic Extraction of Systematic Polysemy Using Tree-cut</Title>
  <Section position="4" start_page="20" end_page="22" type="metho">
    <SectionTitle>
2 Tree Generalization using Tree-cut
</SectionTitle>
    <Paragraph position="0"> and MDL Before we present our method, we first give a brief summary of the tree-cut technique which we adopted from (Li and Abe, 1998). This technique is used to acquire generalized case frame patterns from a corpus using a thesaurus tree.</Paragraph>
    <Section position="1" start_page="20" end_page="21" type="sub_section">
      <SectionTitle>
2.1 Tree-cut Models
</SectionTitle>
      <Paragraph position="0"> A thesaurus tree is a hierarchically organized lexicon where leaf nodes encode lexical data  (i.e., words) and internal nodes represent abstract semantic classes. A tree-cut is a partition of a thesaurus tree. It is a list of internal/leaf nodes in the tree, and each node represents a set of all leaf nodes in a subtree rooted by the node. Such set is also considered as a cluster. 4 Clusters in a tree-cut exhaustively cover all leaf nodes of the tree, and they are mutually disjoint. For example, for a thesaurus tree in Figure 1, there are 5 tree-cuts: \[airplane, helicopter, ball, kite, puzzle\], \[AIRCRAFT, ball, kite, puzzle\], \[airplane, helicopter, TOY\], \[AIR-CRAFT, TOY\] and \[ARTIFACT\]. Thus, a tree-cut corresponds to one of the levels of abstraction in the tree.</Paragraph>
      <Paragraph position="1"> Using a thesaurus tree and the idea of treecut, the problem of acquiring generalized case frame patters (for a fixed verb) from a corpus is to select the best tree-cut that accounts for both observed and unobserved case frame instances. In (Li and Abe, 1998), this generalization problem is viewed as a problem of selecting the best model for a tree-cut that estimates the true probability distribution, given a sample corpus data.</Paragraph>
      <Paragraph position="2"> Formally, a tree-cut model M is a pair consisting of a tree-cut F and a probability parameter vector O of the same length,</Paragraph>
      <Paragraph position="4"> words, that is, P(C) = ~=1 P(nj). Here, compared to knowing all P(nj) (where 1 &lt; j &lt; m) individually, knowing one P(C) can only facilitate an estimate of uniform probability distribution among members as the best guess, that is, P(nj) = P(C) for all j. Therefore, in general, m when clusters C1..Cm are merged and generalized to C according to the thesaurus tree, the estimation of a probability model becomes less accurate.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="22" type="sub_section">
      <SectionTitle>
2.2 The MDL Principle
</SectionTitle>
      <Paragraph position="0"> To select the best tree-cut model, (Li and Abe, 1998) uses the Minimal Description Length (MDL) principle (Rissanen, 1978). The MDL is a principle of data compression in Information Theory which states that, for a given dataset, the best model is the one which requires the minimum length (often measured in bits) to encode the model (the model description length) and the data (the data description length). For the problem of case frame generalization, the MDL principle fits very well in that it captures the trade-off between the simplicity of a model, which is measured by the number of clusters in a tree-cut, and the goodness of fit to the data, which is measured by the estimation accuracy of the probability distribution.</Paragraph>
      <Paragraph position="1"> The calculation of the description length for a tree-cut model is as follows. Given a thesaurus tree T and a sample S consisting of the case frame instances, the total description length L(M, S) for a tree-cut model M = (F, 0) is where Ci (1 &lt; i &lt; k) is a cluster in the treecut, P(Ci) is the probability of a cluster Ci, and ~/k=l P(Ci) = 1. For example, suppose a corpus contained 10 instances of verb-object relation for the verb &amp;quot;fly&amp;quot;, and the frequency of object noun n, denoted f(n), are as follows: f ( airpl ane ) -- 5, f ( helicopter ) = 3, f ( bal l ) = O, f(kite) -- 2, f(puzzle) = 0. Then, the set of tree-cut models for the thesaurus tree shown in Figure 1 includes (\[airplane, helicopter, TOY\], \[0.5, 0.3, 0.2\]) and (\[AIRCRAFT, TOY\], \[0.8, 0.2\]). Note that P(C) is the probability of cluster</Paragraph>
      <Paragraph position="3"> where L(F) is the model description length, L(OIF) is the parameter description length (explained shortly), and L(SIF , O) is the data description length. Note that L(F) + L(OIF ) essentially corresponds to the usual notion of the model description length.</Paragraph>
      <Paragraph position="4"> Each length in L(M, S) is calculated as follows. 5 The model description length L(F) is</Paragraph>
      <Paragraph position="6"> where G is the set of all cuts in T, and IG I denotes the size of G. This value is a constant for * SFor justification and detailed explanation of these formulas, see (Li and Abe, 1998).</Paragraph>
      <Paragraph position="7">  all models, thus it is omitted in the calculation of the total length.</Paragraph>
      <Paragraph position="8"> The parameter description length L(OIF ) indicates the complexity of the model. It is the length required to encode the probability distribution of the clusters in the tree-cut F. It is calculated as</Paragraph>
      <Paragraph position="10"> where k is the length of (r), and IS\[ is the size of S.</Paragraph>
      <Paragraph position="11"> Finally, the data description length L(SIF, O) is the length required to encode the whole sample data. It is calculated as</Paragraph>
      <Paragraph position="13"> where, for each n E C and each C E F,</Paragraph>
      <Paragraph position="15"> Note here that, in (7), the probability of C is divided evenly among all n in C. This way, words that are not observed in the sample receive a non-zero probability, and the data sparseness problem is avoided.</Paragraph>
      <Paragraph position="16"> Then, the best model is the one which requires the minimum total description length.</Paragraph>
      <Paragraph position="17"> Figure 2 shows the MDL lengths for all five tree-cut models that can be produced for the thesaurus tree in Figure 1. The best model is the one with the tree-cut \[AIRCRAFT, ball, kite, puzzle\] indicated by a thick curve in the figure.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="22" end_page="24" type="metho">
    <SectionTitle>
3 Clustering Systematic Polysemy
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="22" end_page="23" type="sub_section">
      <SectionTitle>
3.1 Generalization Technique
</SectionTitle>
      <Paragraph position="0"> Using the generalization technique in (Li and Abe, 1998) described in the previous section, we wish to extract systematic polysemy automatically from WordNet. Our assumption is that, if a semantic concept is systematically related to another concept, words that have one sense under one concept (sub)tree are likely to have another sense under the other concept (sub)tree. To give an example, Figure 3 shows parts of WordNet noun trees for ARTIFACT and MEASURE, where subtrees under CONTAINER and C0NTAINERFUL respectively contain &amp;quot;bottle&amp;quot;, &amp;quot;bucket&amp;quot; and &amp;quot;spoon&amp;quot;. Note a dashed line in the figure indicates an indirect link for more than one level.</Paragraph>
      <Paragraph position="1"> Based on this assumption, it seems systematic polysemy in the two trees can be extracted straight-forwardly by clustering each tree according to polysemy as a feature, and by matching of clusters taken from each tree. 6 To this end, the notion of tree-cut and the MDL principle seem to comprise an excellent tool.</Paragraph>
      <Paragraph position="2"> However, we must be careful in adopting Li and Abe's technique directly: since the problem which their technique was applied to is fundamentally different from ours, some procedures used in their problem may not have any interpretation in our problem. Although both problems are essentially a tree generalization problem, their problem estimates the true probability distribution from a random sample of examples (a corpus), whereas our problem does not have any additional data to estimate, since all data (a lexicon) is already known. This difference raises the following issue. In the calculation of the data description length in equation (6), each word in a cluster, observed or unobserved, is assigned an estimated probability, which is a uniform fraction of the probability of the cluster. This procedure does not have interpretation if it is applied to our problem.</Paragraph>
      <Paragraph position="3"> Instead, we use the distribution of feature frequency proportion of the clusters, and calculate the data description length by the following formula: null</Paragraph>
      <Paragraph position="5"> where F = \[C1,.., Ck\], 0 = \[P(C,),.., P(Ck)\].</Paragraph>
      <Paragraph position="6"> This corresponds to the length required to encode all words in a cluster, for all clusters in a tree-cut, assuming Huffman's algorithm (Huffman, 1952) assigned a codeword of length -log2P(Ci) to each cluster C/ (whose propor- null to our problem without modification.</Paragraph>
    </Section>
    <Section position="2" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
3.2 Clustering Method
</SectionTitle>
      <Paragraph position="0"> Our clustering method uses the the modified generalization technique described in the last section to generate tree-cuts. But before we apply the method, we must transform the data in Wordnet. This is because WordNet differs from a theaurus tree in two ways: it is a graph rather than a tree, and internal nodes as well as leaf nodes carry data, First, we eliminate multiple inheritance by separating shared subtrees. Second, we bring down every internal node to a leaf level by creating a new duplicate node and adding it as a child of the old node (thus making the old node an internal node).</Paragraph>
      <Paragraph position="1"> After trees are transformed, our method extracts systematic polysemy by the following three steps. In the first step, all leaf nodes of the two trees are marked with either 1 or 0 (1 if a node/word appears in both trees, or 0 otherwise), null In the second step, the generalization technique is applied to each tree, and two tree-cuts are obtained. To search for the best tree-cut, instead of computing the description length for M1 possible tree-cuts in a tree, a greedy dynamic programming algorithm is used. This algorithm , called Find-MDL in (Li and Abe, 1998), finds the best tree-cut for a tree by recursively finding the best tree-cuts for all of its sub-trees and merging them from bottom up. This algorithm is quite efficient, since it is basically a depth-first search with minor overhead for computing the description length.</Paragraph>
      <Paragraph position="2"> Finally in the third step, clusters from the two tree-cuts are matched up, and the pairs which have substantial overlap are selected as systematic polysemy.</Paragraph>
      <Paragraph position="3"> Figure 4 shows parts of the final tree-cuts for ARTIFACT and MEASURE obtained by our method. ~ In both trees, most of the clusters in the tree-cuts are from nodes at depth 1 (counting the root as depth 0). That is because the tree-cut technique used in our method is sensitive to the structure of the tree. More specifically, the MDL principle inherently penalizes a complex tree-cut by assigning a long parameter length. Therefore, unless the entropy of the feature distribution is large enough to make the data length overshadow the parameter length, simpler tree-cuts partitioned at abstract levels are preferred. This situation tends to happen often when the tree is bushy and the total feature frequency is low. This was precisely the case with ARTIFACT and MEASURE, where both Tin the figure, bold letters indicate words which are polysemous in the two tree.</Paragraph>
      <Paragraph position="4">  trees were quite bushy, and only 4% and 14% of the words were polysemous in the two categories respectively.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML