File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2181_metho.xml

Size: 19,485 bytes

Last Modified: 2025-10-06 14:15:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2181">
  <Title>Building Accurate Semantic Taxonomies from Monolingual MRDs</Title>
  <Section position="3" start_page="0" end_page="1104" type="metho">
    <SectionTitle>
2 Acquiring taxonomies from MRDs
</SectionTitle>
    <Paragraph position="0"> A straightforward way to obtain a LKB acquiring taxonomic relations from dictionary definitions can be done following a purely bottom up strategy with the following steps: 1) parsing each definition for obtaining the genus, 2) performing a genus disambiguafion procedure, and 3) building a natural classification of the concepts as a concept taxonomy with several tops.</Paragraph>
    <Paragraph position="1"> Following this purely descriptive methodology, the semantic primitives of the LKB could be obtained by collecting those dictionary senses appearing at the top of the complete taxonomies derived from the dictionary. By characterizing each of these tops, the complete LKB could be produced. For DGILE, the complete noun taxonomy was derived following the automatic method described by (Rigau et al. 97) 1.</Paragraph>
    <Paragraph position="2">  However, several problems arise a) due to the source (i.e, circularity, errors, inconsistencies, omitted genus, etc.) and b) the limitation of the genus sense disambiguation techniques applied: i.e, (Bruce et al. 92) report 80% accuracy using automatic techniques, while (Rigau et al. 97) report 83%. Furthermore, the top dictionary senses do not usually represent the semantic subsets that the LKB needs to characterize in order to represent useful knowledge for NLP systems. In other words, there is a mismatch between the knowledge directly derived from an MRD and the knowledge needed by a LKB.</Paragraph>
    <Paragraph position="3"> To illustrate the problem we are facing, let us suppose we plan to place the FOOD concepts in the LKB. Neither collecting the taxonomies derived from a top dictionary sense (or selecting a subset of the top dictionary senses of DGILE) closest to FOOD concepts (e.g., substancia -substance-), nor collecting those subtaxonomies starting from closely related senses (e.g., bebida -drinkable liquids- and alimento -food-) we are able to collect exactly the FOOD concepts present in the MRD. The first are too general (they would cover non-FOOD concepts) and the second are too specific (they would not cover all FOOD dictionary senses because FOODs are described in many ways).</Paragraph>
    <Paragraph position="4"> All these problems can be solved using a mixed methodology. That is, by attaching selected top concepts (and its derived taxonomies) to prescribed semantic primitives represented in the LKB. Thus, first, we prescribe a minimal ontology (represented by the semantic primitives of the LKB) capable of representing the whole lexicon derived from the MRD, and second, following a descriptive approach, we collect, for every semantic primitive placed in the LKB, its subtaxonomies. Finally, those subtaxonomies selected for a semantic primitive are attached to the corresponding LKB semantic category.</Paragraph>
    <Paragraph position="5"> Several prescribed sets of semantic primitives have been created as Ontological Knowledge Bases: e.g. Penman Upper Model (Bateman 90), CYC (Lenat &amp; Guha 90), WordNet (Miller 90).</Paragraph>
    <Paragraph position="6"> Depending on the application and theoretical tendency of the LKB different sets of semantic primitives can be of interest. For instance, WordNet noun top unique beginners are 24 semantic categories. (Yarowsky 92) uses the 1,042 major categories of Roget's thesaurus, (Liddy &amp; Paik 92) use the 124 major subject areas of LDOCE, hypernyms), and 89,458 leaves (which have no hyponyms). That is, 21,334 definitions are placed between the top nodes and the leaves.</Paragraph>
    <Paragraph position="7">  (Hearst &amp; Schfitze, 95) convert the hierarchical structure of WordNet into a fiat system of 726 semantic categories.</Paragraph>
    <Paragraph position="8"> In the work presented in this paper we used as semantic primitives the 24 lexicographer's files (or semantic files) into which the 60,557 noun synsets (87,641 nouns) of WordNet 1.5 (WN1.5) are classified 2. Thus, we considered the 24 semantic tags of WordNet as the main LKB semantic primitives to which all dictionary senses must be attached. In order to overcome the language gap we also used a bilingual Spanish/English dictionary.</Paragraph>
  </Section>
  <Section position="4" start_page="1104" end_page="1106" type="metho">
    <SectionTitle>
3 Attaching DGILE dictionary senses to
</SectionTitle>
    <Paragraph position="0"> semantic primitives In order to classify all nominal DGILE senses with respect to WordNet semantic files, we used a similar approach to that suggested by (Yarowsky 92). Rather than collect evidence from a blurred corpus (words belonging to a Roget's category are used as seeds to collect a subcorpus for that category; that is, a window context produced by a seed can be placed in several subcorpora), we collected evidence from dictionary senses labelled by a conceptual distance method (that is, a definition is placed in one semantic file only).</Paragraph>
    <Paragraph position="1"> This task is divided into three fully automatic consecutive subtasks. First, we tag a subset (due to the difference in size between the monolingual and the bilingual dictionaries) of DGILE dictionary senses by means of a process that uses the conceptual distance formula; second, we collect salient words for each semantic file; and third, we enrich each DGILE dictionary sense with a semantic tag collecting evidence from the salient words previously computed.</Paragraph>
    <Section position="1" start_page="1104" end_page="1105" type="sub_section">
      <SectionTitle>
3.1 Attach WordNet synsets to DGILE
</SectionTitle>
      <Paragraph position="0"> headwords.</Paragraph>
      <Paragraph position="1"> For each DGILE definition, the conceptual distance between headword and genus has been computed using WN1.5 as a semantic net. We obtained results only for those definitions having English translations for both headword and genus. By computing the conceptual distance between two words (Wl,W2) we are also selecting those concepts (Cli,C2j) which represent them and seem to be closer with respect to the semantic net 2One could use other semantic classifications because using this methodology a minimal set of informed seeds are needed. These seeds can be collected from MRDs, thesauri or even by introspection, see (Yarowsky 95). used. Conceptual distance is computed using formula (1).</Paragraph>
      <Paragraph position="2"> min 1 (1) dist(w I,w2) = c~,a ~ )depth(ck) c2~ ~ w2 q e patl~c~ ,c2i That is, the conceptual distance between two concepts depends on the length of the shortest path 3 that connects them and the specificity of the concepts in the path.</Paragraph>
      <Paragraph position="3">  conceptua As the bilingual dictionary is not disambiguated with respect to WordNet synsets (every Spanish word has been assigned to all possible connections to WordNet synsets), the degree of polysemy has increased from 1.22 (WN1.5) to 5.02, and obviously, many of these connections are not correct. This is one of the reasons why after processing the whole dictionary we obtained only an accuracy of 61% at a sense (synset) level (that is, correct synsets attached to Spanish headwords and genus terms) and 64% at a file level (that is, correct WN1.5 lexicogra, pher's file assigned to DGILE dictionary senses)'L We processed 32,2085 dictionary definitions, obtaining 29,205 with a synset assigned to the genus (for the rest we did not obtain a bilingual-WordNet relation between the headword and the genus, see Table 1).</Paragraph>
      <Paragraph position="4"> In this way, we obtained a preliminary version of 29,205 dictionary definitions semantically labelled (that is, with Wordnet lexicographer's files) with an accuracy of 64%.  classified in 24 partitions (each one corresponding to a semantic category). Table 2 compares the distribution of these DGILE dictionary senses (see column a) with respect to WordNet semantic categories. The greatest differences appear with the classes ANIMAL and PLANT, which correspond to large taxonomic scientific classifications occurring in WN1.5 but which do not usually appear in a bilingual dictionary.</Paragraph>
    </Section>
    <Section position="2" start_page="1105" end_page="1105" type="sub_section">
      <SectionTitle>
3.2 Collect the salient words for every semantic
</SectionTitle>
      <Paragraph position="0"> primitive.</Paragraph>
      <Paragraph position="1"> Once we have obtained the first DGILE version with semantically labelled definitions, we can collect the salient words (that is, those representative words for a particular category) using a Mutual Information-like formula (2), where w means word and SC semantic class.</Paragraph>
      <Paragraph position="2"> (2) AR(w, SC) = Pr(wlSC)log 2 Pr(wlSC) Pr(w) Intuitively, a salient word 6 appears significantly more often in the context of a semantic category than at other points in the whole corpus, and hence is a better than average indicator for that semantic category. The words selected are those most relevant to the semantic category, where relevance is defined as the product of salience and local frequency. That is to say, important words should be distinctive and frequent.</Paragraph>
      <Paragraph position="3"> We performed the training process considering only the content word forms from dictionary definitions and we discarded those salient words with a negative score. Thus, we derived a lexicon of 23,418 salient words (one word can be a salient word for many semantic categories, see Table 2, columns b and c).</Paragraph>
    </Section>
    <Section position="3" start_page="1105" end_page="1106" type="sub_section">
      <SectionTitle>
3.3 Enrich DGILE definitions with WordNet
</SectionTitle>
      <Paragraph position="0"> semantic primitives.</Paragraph>
      <Paragraph position="1"> Using the salient words per category (or semantic class) gathered in the previous step we labelled the DGILE dictionary definitions again. When any of the salient words appears in a definition, there is evidence that the word belongs to the category indicated. If several of these words appear, the evidence grows.</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"/>
      <Paragraph position="7"> salient words ~er context) with to res</Paragraph>
      <Paragraph position="9"> ~ect WN1.5 semantic tags.</Paragraph>
      <Paragraph position="10"> We add together their weights, over all words in the definition, and determine the category for which the sum is greatest, using formula (3).</Paragraph>
      <Paragraph position="12"> Thus, we obtained a second semantically labelled version of DGILE (see table 2, column d). This version has 86,759 labelled definitions (covering more than 93% of all noun definitions) with an accuracy rate of 80% (we have gained, since the previous labelled version, 62% coverage and 16% accuracy).</Paragraph>
      <Paragraph position="13"> The main differences appear (apart from the classes ANIMAL and PLANT) in the classes ACT and PROCESS. This is because during the first automatic labelling many dictionary definitions with genus acci6n (act or action) or efecto (effect) were classified erroneously as ACT or PROCESS.</Paragraph>
      <Paragraph position="14"> These results are difficult to compare with those of \[Yarowsky 92\]. We are using a smaller context window (the noun dictionary definitions have 9.68 words on average) and a microcorpus (181,669 words). By training salient words from a labelled dictionary (only 64% correct) rather than a raw corpus we expected to obtain less noise.</Paragraph>
      <Paragraph position="15"> Although we used the 24 lexicographer's files of WordNet as semantic primitives, a more fine-grained classification could be made. For example, all FOOD synsets are classified under &lt;food, nutrient&gt; synset in file 13. However, FOOD concepts are themselves classified into 11 subclasses (i.e., &lt;yolk&gt;, &lt;gastronomy&gt;, &lt;comestible, edible, eatable .... &gt;, etc.). Thus, if the LKB we are planning to build needs to represent &lt;beverage, drink, potable&gt; separately from the concepts &lt;comestible, edible, eatable, ...&gt; a finer set of semantic primitives should be chosen, for instance, considering each direct hyponym of a synset belonging to a semantic file also as a new semantic primitive or even selecting  for each semantic file the level of abstraction we need.</Paragraph>
      <Paragraph position="16"> A further experiment could be to iterate the process by collecting from the second labelled dictionary (a bigger corpus) a new set of salient words and reestimating again the semantic tags for all dictionary senses (a similar approach is used in Riloff &amp; Shepherd 97).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1106" end_page="1106" type="metho">
    <SectionTitle>
4 Selecting the main top beginners for a
</SectionTitle>
    <Paragraph position="0"> semantic primitive This section is devoted to the location of the main top dictionary sense taxonomies for a given semantic primitive in order to correctly attach all these taxonomies to the correct semantic primitive in the LKB.</Paragraph>
    <Paragraph position="1"> In order to illustrate this process we will locate the main top beginners for the FOOD dictionary senses. However, we must consider that many of these top beginners are structured. That is, some of them belong to taxonomies derived from other ones, and then cannot be directly placed within the FOOD type. This is the case of vino (wine), which is a zumo (juice). Both are top beginners for FOOD and one is a hyponym of the other.</Paragraph>
    <Paragraph position="2"> First, we collect all genus terms from the whole set of DGILE dictionary senses labelled in the previous section with the FOOD tag (2,614 senses), producing a lexicon of 958 different genus terms (only 309, 32%, appear more than once in the FOOD subset of dictionary sensesT).</Paragraph>
    <Paragraph position="3"> As the automatic dictionary sense labelling is not free of errors (around 80% accuracy) 8 we can discard some senses by using filtering criteria.</Paragraph>
    <Paragraph position="4"> * Filter 1 (F1) removes all FOOD genus terms not assigned to the FOOD semantic file during the mapping process between the bilingual dictionary and WordNet.</Paragraph>
    <Paragraph position="5"> * Filter 2 (F2) selects only those genus terms which appear more times as genus terms in the FOOD category. That is, those genus terms which appear more frequently in dictionary definitions belonging to other semantic tags are discarded.</Paragraph>
    <Paragraph position="6"> * Filter 3 (F3) discards those genus terms which appear with a low frequency as genus terms in the FOOD semantic category. That is, infrequent genus terms (given a certain threshold) are removed. Thus, F3&gt;1 means that the filtering criteria have discarded those genus terms  appearing in the FOOD subset of dictionary definitions less than twice.</Paragraph>
    <Paragraph position="7"> Table 4 shows the first 10 top beginners for FOOD. Bold face is used for those genus terms removed by filter 2. Thus, pez -fish- is an  * Table 4, frequency of m girmers for FOOD Table 5 shows the performance of the second labelling with respect to filter 3 (genus frequency) varying the threshold. From left to right, filter, number of genus terms selected (#GT), accuracy (A), number of definitions (#D) and their respective accuracy.</Paragraph>
    <Paragraph position="8">  Tables 6 and 7 show that at the same level of genus frequency, filter 2 (removing genus terms which are more frequent in other semantic categories) is more accurate that filter 1 (removing all genus terms the translation of which cannot be FOOD). For instance, no error appears when selecting those genus terms which appear 10 or more times (F3) and are more frequent in that category than in any other (F2).</Paragraph>
    <Paragraph position="9"> Table 8 shows the coverage of correct genus terms selected by criteria F1 and F2 to respect criteria F3. Thus, for genus terms appearing 10 or more times, by using either of the two criteria we are collecting 97% of the correct ones. That is, in both cases the criteria discards less than 3% of correct genus terms.</Paragraph>
  </Section>
  <Section position="6" start_page="1106" end_page="1108" type="metho">
    <SectionTitle>
5 Building automatically large scale
</SectionTitle>
    <Paragraph position="0"> taxonomies from DGILE The automatic Genus Sense Disambiguation task in DGILE has been performed following (Rigau et al. 97). This method reports 83% accuracy when selecting the correct hypernym by combining eight different heuristics using several methods and types of knowledge. Using this combined technique the selection of the correct hypernym from DGILE had better performance than those reported by (Bruce et al. 92) using LDOCE.</Paragraph>
    <Paragraph position="1"> Once the main top beginners (relevant genus terms) of a semantic category are selected and every dictionary definition has been disambiguated, we collect all those pairs labelled with the semantic category we are working on  having one of the genus terms selected. Using these pairs we finally build up the complete taxonomy for a given semantic primitive. That is, in order to build the complete taxonomy for a semantic primitive we fit the lower senses using the second labelled lexicon and the genus selected from this labelled lexicon.</Paragraph>
    <Paragraph position="2"> Table 9 summarizes the sizes of the FOOD taxonomies acquired from DGILE with respect to filtering criteria and the results manually  Using the first set of criteria (F2+F3&gt;9), we acquire a FOOD taxonomy with 952 senses (more than two times larger than if it is done manually). Using the second one (F2+F3&gt;4), we obtain another taxonomy with 1,242 (more than three times larger). While using the first set of criteria, the 33 genus terms selected produce a taxonomic structure with only 18 top beginners, the second set, with 68 possible genus terms, produces another taxonomy with 48 top beginners. However, both final taxonomic structures produce more flat taxonomies than if the task is done manually.</Paragraph>
    <Paragraph position="3"> This is because we are restricting the inner taxonomic genus terms to those selected by the criteria (33 and 68 respectively). Consider the following taxonomic chain, obtained in a semiautomatic way by (Castell6n 93): bebida_13 &lt;- llquido 16 &lt;- zumo 1 1 &lt;-vino 1_1 &lt;- rueda 1_1 As liquido -liquid- was not selected as a possible genus (by the criteria described above), the taxonomic chain for that sense is: zumo_l_l &lt;-vino 1 1 &lt;-rueda 1 1 9We used the results reported by (CasteIl6n 93) as a baseline because her work was done using the same Spanish dictionary. Thus, a few arrangements (18 or 48 depending on the criteria selected) must be done at the top level of the automatic taxonomies. Studying the main top beginners we can easily discover an internal structure between them. For instance, placing all zumo (juice) senses within bebida (drink).</Paragraph>
    <Paragraph position="4"> Performing the same process for the whole dictionary we obtained for F2+F3&gt;9 a taxonomic structure of 35,099 definitions and for F2+F3&gt;4 the size grows to 40,754.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML