File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1086_metho.xml

Size: 18,057 bytes

Last Modified: 2025-10-06 14:14:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1086">
  <Title>Inherited Feature-based Similarity Measure Based on Large Semantic Hierarchy and Large Text Corpus</Title>
  <Section position="4" start_page="508" end_page="508" type="metho">
    <SectionTitle>
3 Components of IFSM model
</SectionTitle>
    <Paragraph position="0"> IFSM consists of a hierarchical conceptual thesaurus, a set of distinctive features assigned to each object and weightings of the features. We can use, for example, WordNet or the EDR concept dictionary as a hierarchical conceptual thesaurus. Currently, there are no explicit methods to determine sets of distinctive features and their weightings of each object (word or concept).</Paragraph>
    <Paragraph position="1"> Here we adopt an automatic extraction of features and their weightings from a large text corpus. This is the same approach as that of the distribdted semantic models. However, in contrast to those models, here we hope to make the level of the representation of features high enough to capture semantic behaviors of objects.</Paragraph>
    <Paragraph position="2"> For example, if one relation and one object can be said to describe the features of object, we can define one feature Of &amp;quot;human&amp;quot; as &amp;quot;agent of walking&amp;quot;. If more context is allowed, we can define a feature of &amp;quot;human&amp;quot; as &amp;quot;agent of utilizing fire&amp;quot;. A wider context gives a precision to the contents of the features. However, a wider context exponentially increases the possible number of features which will exceed current limitations of computational resources. In consideration of these factors, we adopts triple relations such as &amp;quot;dog chase cat&amp;quot;, &amp;quot;cut paper with scissors&amp;quot; obtained from the cot&amp;quot;k dog chases a cat&amp;quot; &amp;quot;k hound chases a cat&amp;quot; &amp;quot;A dog chases a kitty&amp;quot; (&amp;quot;chase&amp;quot; &amp;quot;dog&amp;quot; &amp;quot;cat&amp;quot;) (*'chase&amp;quot; &amp;quot;hound&amp;quot; &amp;quot;cat&amp;quot;) (&amp;quot;chase&amp;quot; &amp;quot;dog&amp;quot; &amp;quot;kitty&amp;quot;) .... .............. o' o .............. o / ....... o ...........  pus as a resource of features, and apply class based abstraction (Resnik 95a) to triples to reduce the size of the possible feature space.</Paragraph>
    <Paragraph position="3"> As mentioned above, features extracted fi'om the corpus will be represented using synsets/concepts in IFSM. Since no large scale corpus data with semantic tags is available, the current implementation of IFSM has a word sense disambiguation problem in obtaining class probabilities. Our current basic strategy to this problem is similar to (Resnik,95a) in the sense that synsets associated with one word are assigned uniform frcquency or &amp;quot;credit&amp;quot; when that word appears in the corpus.</Paragraph>
    <Paragraph position="4"> We call this strategy the &amp;quot;brute-force&amp;quot; approach, like Resnik. On top of this strategy, we introduce filtering heuristics which sort out unreliable flata using heuristics based on the statistical properties of the data.</Paragraph>
  </Section>
  <Section position="5" start_page="508" end_page="512" type="metho">
    <SectionTitle>
4 The feature extraction process
</SectionTitle>
    <Paragraph position="0"> This section describes the feature extraction procedure. If a sentence &amp;quot;a dog chased a cat&amp;quot; appears in the corpus, features representing &amp;quot;chase cat&amp;quot; and &amp;quot;dog chase&amp;quot; may be attached to &amp;quot;dog&amp;quot; and &amp;quot;cat&amp;quot; respectively. Fig 4 shows the overall process used to obtain a set of abstracted triples which are sources of feature and weighting sets for synsets.</Paragraph>
    <Section position="1" start_page="508" end_page="510" type="sub_section">
      <SectionTitle>
4.1 Extraction of surface typed triples
</SectionTitle>
      <Paragraph position="0"> from the corpus Typed surface triples are triples of surface words holding some fixed linguistic relations (Hereafter call this simply &amp;quot;surface triples&amp;quot;). The current implementation has one type &amp;quot;SO&amp;quot; which represents  &amp;quot;subject - verb - object&amp;quot; relation. A set of typed surface triples are extracted from a corpus with their frequencies.</Paragraph>
    </Section>
    <Section position="2" start_page="510" end_page="510" type="sub_section">
      <SectionTitle>
4.2 Expansion of sin-face triples to deep
triples
</SectionTitle>
      <Paragraph position="0"> Surface triples are expanded to corresponding deep triples (triples of synset IDs) by expanding each surface word to its corresponding synsets.</Paragraph>
      <Paragraph position="1"> The frequency of the surface triples is divided by the number of generated deep triples and it is assigned to each deep triple. The frequency is also preserved ~ it is as an occurrence count. Surface words are also reserved for later processings.</Paragraph>
      <Paragraph position="2">  &amp;quot;v123&amp;quot; and &amp;quot;n5&amp;quot; are synset IDs corresponding to word &amp;quot;chase&amp;quot; and &amp;quot;dog&amp;quot; respectively, These deep triples are sorted and merged. The frequencies and the occurrence counts are summed up respectively. The surface words are merged into surface word lists as the following example shows.</Paragraph>
    </Section>
    <Section position="3" start_page="510" end_page="510" type="sub_section">
      <SectionTitle>
4.3 Synset abstraction method
</SectionTitle>
      <Paragraph position="0"> The purpose of the following phases is to extract featm:e sets for each synset in an abstracted form.</Paragraph>
      <Paragraph position="1"> In an abstracted form, the size of each lhature space becomes tractable.</Paragraph>
      <Paragraph position="2"> Abstraction of a syuset can be done by divid~ ing whole synsets into the appropriate number of synset groups and determining a representative of each group to which each member is abstracted.</Paragraph>
      <Paragraph position="3"> There are several methods to decide a set of synset groups using a hierarchical structure. One of the simplest methods is to make groups by cutting the hierarchy structure at some depth from the root.</Paragraph>
      <Paragraph position="4"> We call this the flat-depth grouping method. Another method tries to make the nmnber of synsets in a group constant, i.e., the upper/lower bound for a number of concepts is given as a criteria (ttearst,93). We call this the flat-size grouping method. In our implementation, we introduce a new grouping method called the flat-probability grouping method in which synset groups are specified such that every group has the same class probabilities. One of the advantages of this method is that it is expected to give a grouping based on the quantity of information which will be suitable for the target task, i.e., semantic abstraction of triples. The degree of abstraction, i.e., the number of groups, is one of the principal factors in deciding the size of the feature space and the preciseness of the features (power of description).</Paragraph>
    </Section>
    <Section position="4" start_page="510" end_page="510" type="sub_section">
      <SectionTitle>
4.4 Deep triple abstraction
</SectionTitle>
      <Paragraph position="0"> Each synset of deep triples is abstracted based on the flat-probability grouping method. These abstracted triples are sorted and merged. Original synset IDs are maintained in this processing for feature extraction process. The result is called the abstracted deep triple set.</Paragraph>
      <Paragraph position="1">  Synset &amp;quot;v28&amp;quot; is an abstraction of synset &amp;quot;v123&amp;quot; and synset &amp;quot;v224&amp;quot; which corresponds to &amp;quot;chase&amp;quot; and &amp;quot;run_after&amp;quot; respectively. Synset &amp;quot;ng&amp;quot; con:esponding to &amp;quot;cat&amp;quot; is an abstraction of synset &amp;quot;nS&amp;quot; corresponding to &amp;quot;kitty&amp;quot;.</Paragraph>
    </Section>
    <Section position="5" start_page="510" end_page="510" type="sub_section">
      <SectionTitle>
4.5 Filtering abstracted triples by
</SectionTitle>
      <Paragraph position="0"> heuristics Since the current implementation adepts the &amp;quot;brute-force&amp;quot; approach, almost all massively generated deep triples are fake triples. The filtering process reduces the number of abstracted triples using heuristics based on statistical data attached to the abstracted triples. There are three types of statistical data available; i.e., estimated frequency, estimated occurrences of abstracted triples and lists of surface words.</Paragraph>
      <Paragraph position="1"> \[ler% the length of a surface word list associated with an abstracted synset is called a surface support of the abstracted synset. A heuristics rule using some fixed frequency threshold and a surface support bound are adopted in the current implementation. null</Paragraph>
    </Section>
    <Section position="6" start_page="510" end_page="511" type="sub_section">
      <SectionTitle>
4.6 Common feature extraction from
</SectionTitle>
      <Paragraph position="0"> abstracted triple set This section describes a method for obtaining features of each synset. Basically a feature is typed binary relation extracted from an abstracted triple. From the example triple, (SO v28 115 n9 (v12a v224) (,,5) (,,9 ns) ,~,a a~ (&amp;quot; chase&amp;quot; &amp;quot;run &amp;quot;'after&amp;quot;)(&amp;quot; dog&amp;quot; &amp;quot;hound&amp;quot;) (&amp;quot; cat&amp;quot; &amp;quot;kitty&amp;quot;)) the following features are extracted for three of the synsets contained in the above data.</Paragraph>
      <Paragraph position="2"> An abstracted triple represents a set of exmnples in the text corpus and each sentence in the corpus usually describes some specific event.</Paragraph>
      <Paragraph position="3"> This means that the content of each abstracted  triple cannot be treated as generally or universally true. For example, even if a sentence &amp;quot;a man bit a dog&amp;quot; exists in the corpus, we cannot declare that &amp;quot;biting dogs&amp;quot; is a general property of &amp;quot;man&amp;quot;. Metaphorical expressions are typical examples. Of course, the distributional semantics approach assumes that such kind of errors or noise are hidden by the accumulation of a large number of examples.</Paragraph>
      <Paragraph position="4"> However, we think it might be a more serious problem because many uses of nouns seem to have an anaphoric aspect, i.e., the synset which best fits the real world object is not included in the set of synsets of the noun which is used to refer to the real world object. &amp;quot;The man&amp;quot; can be used to express any descendant of the concept &amp;quot;man&amp;quot;. We call this problem the word-referent disambiguation problem. Our approach to this problem will be described elsewhc're.</Paragraph>
      <Paragraph position="5"> Preliminary experiments on feature extraction using 1010 corpus In this section, our preliminary experiments of the feature extraction process are described. In these experiments, we examine the proper granularity of abstracted concepts. We also discuss a criteria for evaluating filtering heuristics. Word-Net 1.4, 1010 corpus and Brown corpus are utilized through the exI)eriments. The 1010 corpus is a multiqayered structured corpus constructed on top of the FRAMEIX-D knowledge representation language. More than 10 million words of news articles have been parsed using a multi-scale parser and stored in the corpus with mutual references to news article sources, parsed sentence structures, words and WordNet synsets.</Paragraph>
    </Section>
    <Section position="7" start_page="511" end_page="511" type="sub_section">
      <SectionTitle>
5.1 Experiment on fiat-probability
</SectionTitle>
      <Paragraph position="0"> grouping To examine the appropriate number of abstracted synsets, we calculated three levels of abstracted synset sets using the fiat probability grouping method. Class probabilities for noun and verb synsets are calculated using the brute force method based on 280K nouns and 167K verbs extracted fl'om the Brown eortms (1 million words). We selected 500, 1500, 3000 synset groups for candidates of feature description level. The 500 node level is considered to be a lowest boundary and the 3000 node level is expected to be the tar- null get abstraction level. This expectation is based on the observation that 3000 node granularity is empirically sulficient for deseribing the translation patterns for selecting the proper target Fmglish verb for one Japanese verb(lkehara,93).</Paragraph>
      <Paragraph position="1"> Table 1 shows the average synset node depth and the distribution of synset node depth of Word-Net1.4. Table 2 lists the top five noun synsets in the fiat probability groupings of 500 and 3000 synsets. &amp;quot;{}&amp;quot; shows synset. The first and the second number in &amp;quot;0&amp;quot; shows the class frequency and the depth of synset respectively.</Paragraph>
      <Paragraph position="2"> Level 500 grout)ings contain a very abs|racted level of synsets such as &amp;quot;action&amp;quot;, &amp;quot;time_period&amp;quot; and &amp;quot;natural_object&amp;quot;. This level seems to be too general for describing the features of objects.</Paragraph>
      <Paragraph position="3"> In contrast, the level 3000 groupings contains &amp;quot;natural_language&amp;quot;, &amp;quot;weapotf', &amp;quot;head,chief', and &amp;quot;point_in_time&amp;quot; which seems to be a reasonable basis for feature description.</Paragraph>
      <Paragraph position="4"> There is a relatively big depth gap between synsets in the abstracted synset group. F, ven in the 500 level synset group, there is a two-depth gap. In the 3000 level synset group, there is 4 depth gap between &amp;quot;capitalist&amp;quot; (depth 4:) and &amp;quot;point_in_time&amp;quot; (depth 8). The interesting point here is that &amp;quot;point_in_time&amp;quot; seems to be more at). stract than &amp;quot;capitalist, &amp;quot; inluitively speaking. The actual synset numbers of each level of synset groups are 518, 15%8, and 3001. 'fhus the fiat probability grouping method can precisely control the lew'J of abstraction. Considering the possible abstraction levels available by the fiatdepth method, i.e., depth 2 (122 synsets), depth 3 (966 synsets), depth 4 (2949 synsets), this is a great advantage over the flat probability grouping.</Paragraph>
    </Section>
    <Section position="8" start_page="511" end_page="512" type="sub_section">
      <SectionTitle>
5.2 Experiment: Abstracted triples from
</SectionTitle>
      <Paragraph position="0"> 1010 corpus A preliminary experiment for obtaining abstract triples as a basis of features of synsets was conducted. 82,703 surface svo triples are extracted from the 101.0 corpus. Polarities of abstracted triple sets for 500, 1500, 3000 level abstraction are 1.20M, 2.03M and 2.30M respectively. Each  abstract triple holds ft:equeu(:y, oc&lt;:llJ'lX)llO(, lltl/ll-. be G and woful list, which mqq~(a't.s each of thce(~ al)st ra(:ted sy nsel:s.</Paragraph>
      <Paragraph position="1"> A lilt ering heuristic that elin~htates al:+sll'a&lt;:t trit,les whose stlr\['ac(; Sul)pOrl is three (i.e., supported })y only one sm'face \])~I~LC\]:II) iS al&gt;plicd to each set of al)sLracLed Iril)les , ;111(1 l;(.','-.;tl\[I.s iu the R)llowing sizes of at)stract:ed triple sets in the 379K (level 500), 150b: (level 1500) and 561,: (\]ewq 3000) respectively. F, ach triple is assiglted a evaluation score which is a snt|, of m)rnmlized surface SUl)l)(~rL score (:: Sll.l:f3e('. Sllt)l)orl; s&lt;:ore/tl-taXilHtllll ,qlll'I'~/ce SUl)l)orL score) ;+tim normalized \[\]:e(luet~(;y (~ fre+ (luency / nmxi/unm fi'equency). 'l'at)le 3 shows the top \[be abstra&lt;'ted tril)les with respect to dw+ir ewduaLiot~ scores, ltetns in the talJe shows subject syllseL, vert)SyllSel;, oh j0el; synseL, sttrfa&lt;:e supl)Orl;, fr&lt;'qtleltcy ~tll(\] oc('.ll\])re\]lC(~ IlllIlI\])0I!S. All the sul&lt;iccl;s in the top five al)sLract triples of level 500 are &amp;quot;organizal;iotf'. This seems to be. r0asonal)le bee;rose the COlll;eltl;s of the 10\] 0 corpus are news articles ~tt(t l:hese triples seem to show some highly abstract, briefing of the cont.ent, s of the corpus.</Paragraph>
      <Paragraph position="2"> The clfcclAveness of the filtering ;re(l/or scoring hcuri6dcs ca, n bc tl:l(~a,,stlr(;(l ttsilt~ tvC/(~ ch)scly re-. lated criteria. One measm:es the l)lausitfility o\[' al)stract.ed triple,s i.e., the r(x'all and l)cecision |'a-+ I;io of the l)\]ausible at)straeted Lriples. 'l'he other criteria shows the correctness of the mappings of die surface t;riple I)atLerns to abstracted tril)les.</Paragraph>
      <Paragraph position="3"> varsity \[htiled_Nations t:ealn subsidiary State state staff so+ vier school l'olitburo police patrol party palml Organization OI'(\[CI&amp;quot; operation lle!'vVSl)~/,})Vl' lilissioll Ministry lll(:II21)t'F lll\[tg\[~zine lin(: law_firm law hind 3u:~tice_l)epartmcnt jury industry hOL|S(? h&lt;~ztd(tual't,.'I'S govut'illI/tHlt g?tllg I:tlA division COlllq ,:'OllHtry co/lllcil Collf(!l'eltc(! &lt;:(Jllll)tllly ('ommitlct (:ollcge (:ht\]2 Cabinet business board associ~tion Association airline Table 4. SIIrflk(?( ~. Snplmrt, s of &amp;quot;orgmdzation&amp;quot; This is measured I)y counting the eon:ect surface supports of each absl.racted triple, l&amp;quot;or example, considering a set of sm:l';u:e words sut~port.ing &amp;quot;o&gt; ganization&amp;quot; of Lhe I of level ,500 shown in table 4, the word &amp;quot;panel&amp;quot; rnight loe used as &amp;quot;panel board&amp;quot;. 'l'his abilily is also measm:ed by developing the word sense dismnbiguator whic.h inputs the surfa(:e tril)le and select:s lhe most l~\[ausil)le deep Iril)le based ou abstracted triple scores matched with the deep triple, 'Flm surface SUlh~octs iu 'l';t-hie 4 show the intuitNe tendency that a suftlcient number of triple data will generate solid results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML