File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1150_metho.xml

Size: 26,783 bytes

Last Modified: 2025-10-06 14:08:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1150">
  <Title>Hovy, E. (2001). Comparing Sets of Semantic</Title>
  <Section position="2" start_page="0" end_page="8" type="metho">
    <SectionTitle>
1 Evaluating ontologies
</SectionTitle>
    <Paragraph position="0"> Automatic methods for ontology learning and population have been proposed in recent literature (e.g. ECAI-2002 and KCAP-2003 workshops  ) but a co-related issue then becomes the evaluation of such automatically generated ontologies, not only with the goal of comparing the different approaches (Hovy, 2001) and ontology-based tools (Angele and Sure, 2002), but also to verify whether an automatic process may actually compete with the typically human process of converging on an agreed conceptualization of a given domain. Ontology construction, apart from the technical aspects of a knowledge representation task (i.e. choice of representation languages, consistency and correctness with respect to axioms, etc.), is a consensus building process, one that implies long and often harsh discussions among the specialists of a given domain. Can an automatic method simulate this process? Can we provide domain specialists with a means to measure the adequacy of a specific set of concepts as a model of a given  domain?, Specialists are often unable to evaluate the formal content of a computational ontology (e.g. the denotational theory, the formal notation, the knowledge representation system capabilities like property inheritance, consistency, etc.). Evaluation of the formal content is rather tackled by computational scientists, or by automatic verification systems. The role of the specialists is instead to compare their intuition of a domain with the description of this domain, as provided by the ontology concepts. To facilitate one such qualitative per-concept evaluation, we devised a method for automatic generation of textual explanations (glosses) of automatically learned concepts. Glosses provide a description, in natural language, of the formal specifications assigned to the learned concepts. An expert can easily compare his intuition with these natural language descriptions.</Paragraph>
    <Paragraph position="1"> The objective of the gloss-based evaluation is, as previously remarked, to obtain a judgement, by domain specialists, concerning the adequacy of an automatically derived domain conceptualisation. On the computational side, an ontology learning tool is based on a battery of software programs aimed at extracting and formalising domain knowledge, usually starting from unstructured data. Therefore, it is equally important to produce a detailed evaluation of these programs, on a quantitative ground, in order to gain insight on the internal and external contingencies that may affect the result of an ontology learning process.</Paragraph>
    <Paragraph position="2"> In what follows, we firstly provide a quantitative evaluation of the OntoLearn ontology learning system, under different learning circumstances.</Paragraph>
    <Paragraph position="3"> Secondly, we describe the gloss-based per-concept evaluation method. Both evaluation strategies are experimented in two application domains: Tourism and Economy.</Paragraph>
    <Paragraph position="4"> The subsequent section provides a sketchy description of the OntoLearn algorithms. Details are found in (Navigli and Velardi, 2004) and (Navigli, Velardi and Gangemi, 2003). Sections 3 and 4 are dedicated to the quantitative and qualitative analyses of OntoLearn.</Paragraph>
    <Paragraph position="5"> 2 Summary of the OntoLearn system OntoLearn is an ontology population method based on text mining and machine learning techniques. OntoLearn starts with an existing generic ontology (we use WordNet, though other choices are possible) and a set of documents in a given domain, and produces a domain extended and trimmed version of the initial ontology. The ontology generated by OntoLearn is anchored to texts, it can be therefore classified as a linguistic ontology (Gomez-Perez et al. 2004).</Paragraph>
    <Paragraph position="6"> OntoLearn has been applied to different domains (tourism, computer networks, economy) and in several European projects  .</Paragraph>
    <Paragraph position="7"> Concept learning is achieved in the following three phases:</Paragraph>
    <Paragraph position="9"> extracted from a set of documents that are judged representative of a given domain.</Paragraph>
    <Paragraph position="10"> MWEs are extracted using natural language processing and statistical techniques.</Paragraph>
    <Paragraph position="11"> Contrastive corpora and glossaries in different domains are used to prune terminology that is not domain-specific. Domain MWEs are selected also on the basis of an entropy-based measure that simulates specialist consensus on concepts choice: in words, the probability distribution of a &amp;quot;good&amp;quot; domain MWE must be uniform across the individual documents of the domain corpus.</Paragraph>
    <Paragraph position="12"> 2) Semantic interpretation of MWEs: Semantic interpretation is based on a principle, compositional interpretation, and on a novel algorithm, called structural semantic interconnections (SSI). Compositional interpretation signifies that the meaning of a multi-word expression (MWE) can be derived compositionally from its components  , e.g. the meaning of business plan is derived first, by associating the appropriate concept identifier, with reference to the initial top ontology, to the component terms (i.e. sense #2 of business and sense #1 of plan in WordNet), and then, by identifying the semantic relations holding among the involved concepts (e.g.</Paragraph>
    <Paragraph position="13">  E.g. : Harmonize IST-2000-29329 and the INTEROP network of excellence, started on december 2003.</Paragraph>
    <Paragraph position="14">  In the literature, multi word expressions are classified as compositional, idiosyncratically compositional and noncompositional. In mid-technical domains, compositional MWEs cover about 60-70% of MWE (we cannot support with data our statitics for sake of space) plan#1  topic - business# 2 ).</Paragraph>
    <Paragraph position="15"> 3) Extending and trimming the initial ontology: Once the terms have been semantically interpreted, they are organized in sub-trees, and appended under the appropriate node of the initial ontology, e.g. business _ plan# 1 kind _ of - plan# 1 .</Paragraph>
    <Paragraph position="16">  Furthermore, certain upper and lower nodes of the initial ontology are pruned to create a domain-view of the ontology. The final ontology is output in OWL language.</Paragraph>
    <Paragraph position="17"> SSI lies in the area of syntactic pattern matching algorithms (Bunke and Sanfeliu, 1990). It is a word sense disambiguation algorithm used to determine the correct sense (with reference to the initial ontology) for each component of a complex MWE. The algorithm is based on building a graph representation for alternative senses of each MWE component  , and then selecting the appropriate senses on the basis of detected semantic interconnection patterns between graph pairs. The SSI algorithm seeks for semantic interconnections among the words of a context T. Contexts T i are generated from groups of partially overlapping complex MWEs (extracted during phase 1 of the OntoLearn procedure) sharing the same syntactic head. For example, given the list of complex MWEs securities portfolio, investment portfolio, real-estate portfolio, junk-bond portfolio, diversified portfolio, stock portfolio, bond portfolio, loan portfolio, the following list of term components is created: T=[security, investment, real-estate, estate, bond, junk-bond, diversified, stock, portfolio, loan ] Relevant pattern types are described by a context free grammar G. An example of rule in G is the following (S  For instance, in railways company, the gloss of railway#1 contains the word organization, and there is an hyperonymy path of length 2 between company#1 and organization#1. That is:</Paragraph>
    <Paragraph position="19"> organization#1. This pattern (an instance of the gloss+hypeonymyr/meronymy rule) cumulates  We remark again that a detailed description of the SSI algorithm is in (Navigli &amp; Velardi, 2004) and (Navigli, Velardi and Gangemi, 2003). Graphs are generated on the basis of lexico-semantic information in WordNet and in a variety of on-line resources, see the mentioned papers for details.</Paragraph>
    <Paragraph position="20"> evidence for senses #1 of both railway and company.</Paragraph>
    <Paragraph position="21"> In SSI, the correct sense S</Paragraph>
    <Paragraph position="23"> for a term t[?]T is selected depending upon the number and weight of patterns matching with rules in G. The weights of patterns are automatically learned using a perceptron  model. The weight function is given by:</Paragraph>
    <Paragraph position="25"> is the weight of rule j in G, and the second addend is a smoothing parameter inversely proportional to the length of the matching pattern (e.g. 2 in the previous example, since 2 is the minimal length of the rule, and the actual length of the pattern is 3). The perceptron has been trained on the SemCor  semantically annotated corpus.</Paragraph>
    <Paragraph position="26"> In order to complete the semantic interpretation process, OntoLearn then attempts to determine the semantic relations that hold between the components of a complex concept. In order to do this, it was first necessary to select an inventory of semantic relations. We examined several proposals, like EuroWordnet (Vossen, 1999), DOLCE (Masolo et al., 2002), FrameNet (Ruppenhofer Fillmore &amp; Baker, 2002) and others. As also remarked in (Hovy, 2001), no systematic methods are available in literature to compare the different sets of relations. Since our objective was to define an automatic method for semantic relation extraction, our final choice was to use a reduced set of FrameNet relations, which seemed general enough to cover our application domains (tourism, computer networks, economy).</Paragraph>
    <Paragraph position="27"> The choice of FrameNet is motivated by the availability of a sufficiently large set of annotated examples of conceptual relations  , that we used to train an available machine learning algorithm, TiMBL (Daelemans et al., 2002). The relations used are: Material, Purpose, Use, Topic, Product,  . Examples for each relation are the following:  The choice of FrameNet was motivated more by availability than appropriateness.</Paragraph>
    <Paragraph position="28">  The relation Attribute is not in FrameNet, however it was a useful relation for terminological strings of the adjective_noun type.</Paragraph>
    <Paragraph position="30"> - com pany# 1 We represented training instances as pairs of concepts annotated with the appropriate conceptual relation, e.g.: [(computer#1,maker#3),Product] Each concept is in turn represented by a feature-vector where attributes are the concept's hyperonyms in WordNet, e.g.:</Paragraph>
  </Section>
  <Section position="3" start_page="8" end_page="12" type="metho">
    <SectionTitle>
3 Quantitative Evaluation of OntoLearn
</SectionTitle>
    <Paragraph position="0"> This section provides a quantitative evaluation of OntoLearns main algorithms. We believe that a quantitative evaluation is particularly important in complex learning systems, where errors can be produced at almost any stage. Even though some of these errors (e.g. subtle sense distinctions) may not have a percievable effect on the final ontology, as shown by the results of the qualitative evaluation in Section 4.2, it is nevertheless important to gain insight on the actual system capabilities, as well as on the pararmeters and external circumstances that may positively or negatively influence the final performance.</Paragraph>
    <Section position="1" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
3.1 Evaluating the MWE extraction
</SectionTitle>
      <Paragraph position="0"> algorithm The terminology extraction algorithm has been evaluated in the context of the European project Harmonise on Tourism interoperability. We first collected a corpus of about 1 million words of tourism documents, mainly descriptions of travel and tourism sites. From this corpus, a syntactic parser extracted an initial list of 14,383 candidate complex MWEs from which the statistical filters selected a list of 3,840 domain-relevant complex MWEs, that were submitted to the domain specialists. The Harmonise ontology partners were not skilled to evaluate the OntoLearn semantic interpretation of MWEs, therefore we let them evaluate only the domain appropriateness of the terms. The gloss generation method described in Section 4 was subsequently concieved to overcome this limitation.</Paragraph>
      <Paragraph position="1"> We obtained a precision ranging from 72.9% to about 80% and a recall of 52.74%. The precision shift is due to the well-known fact that experts may have different intuitions about the relevance of a concept. The recall estimate was produced by manually inspecting 6,000 of the initial 14,383 candidate MWEs, asking the experts to mark all the MWEs judged as &amp;quot;good&amp;quot; domain MWEs, and comparing the obtained list with the list of terms automatically filtered by OntoLearn.</Paragraph>
      <Paragraph position="2"> We ran similar experiments on an Economy corpus and a Computer Network corpus, but in this case the evaluation was performed by the authors. Overall, the performance of the MWE extraction task appears to be influenced by the dimension and the focus of the starting corpus (e.g. &amp;quot;generic tourism&amp;quot; vs. &amp;quot;hotel accomodation descriptions&amp;quot;). Small and unfocused corpora do not favor the efficacy of statistical analysis. However, the availability of sufficiently large and focused corpora seems a realistic requirement for most applications.</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="10" type="sub_section">
      <SectionTitle>
3.2 Evaluating the ontology learning
algorithms
</SectionTitle>
      <Paragraph position="0"> The distinctive task performed by OntoLearn is semantic disambiguation. The performance of the SSI algorithm critically depends upon two factors: the first is the ability to detect semantic interrelations among concepts associated to the words of complex MWEs, the second is the dimension of the context T available to start the disambiguation process.</Paragraph>
      <Paragraph position="1"> As for the first factor, there are two possible ways of enhancing reliable identification of semantic interconnections: one is to tune at best the weight of individual rules in G (e.g. formula (1) in Section 2), the second is to enrich the semantic information associated to alternative word senses. The latter is an on-going research activity.</Paragraph>
      <Paragraph position="2"> As far as the context T is concerned, the intuition is that, with a larger T , there are higher chances of detecting semantic patterns among the &amp;quot;correct&amp;quot; senses of the terms in T. However, the</Paragraph>
      <Paragraph position="4"> is an external contingency, it depends upon the available corpus. Accordingly, we evaluated the SSI algorithm using as parameters the dimension of T, T , and the weights associated to rules in G. We ran several experiments over the full terminology extracted from the Economy and Tourism corpora, but performances are computed only on, respectively, 453 and 638 manually disambiguated terms. This means that in a context T</Paragraph>
      <Paragraph position="6"> including, e.g. k terms, we evaluate OntoLearn's sense choices only for the fragment of j [?] k terms, for which the &amp;quot;true&amp;quot; sense has been manually assigned.</Paragraph>
      <Paragraph position="7"> Table 1 shows the performance of SSI (precision and recall) when using only patterns whose weight, computed with formula (1) is over a threshold th . The &amp;quot;Core&amp;quot; column in Table 1 shows the performance of SSI when accepting only these core patterns, while the third column refers to all matching patterns. With th=0,7 a subset of 7-9 rules  in G (over a total of 20) are used by the algorithm. Interestingly enough, these rules have a high probability of being hired, as shown by the relatively low difference in recall. The Baseline tower in Table 1 is computed selecting always the first sense (senses in WordNet are ordered by probability in everyday language).</Paragraph>
      <Paragraph position="8"> Table 2 shows that performance of SSI is indeed affected by the dimension of T. Large T , as expected, improves the performance, however, overly large contexts (&gt;80 terms) may favor the detection of non-relevant patterns.</Paragraph>
      <Paragraph position="9"> In general, both experiments show that the Economy corpus performs better than the Tourism, since the latter is less technical (the baseline is quite high), rather unfocused, and contexts T</Paragraph>
      <Paragraph position="11"> We remark that SSI performs better than standard WSD (word sense disambiguation) tasks but this is also motivated by the fact that context words in T are more interrelated than co-occurring words in generic sentences. The SSI algorithm, by  , gloss disambiguation exercise, placed about 1% below the first and about 11% before the third participant.</Paragraph>
    </Section>
    <Section position="3" start_page="10" end_page="12" type="sub_section">
      <SectionTitle>
3.3 Evaluating the semantic annotation
</SectionTitle>
      <Paragraph position="0"> algorithm To test the semantic relation annotation task, we used a learning set (including selected annotated examples from FrameNet (FN), Tourism (Tour), and Economy (Econ)), and a test set with a distribution of examples shown in Table 3.</Paragraph>
      <Paragraph position="1">  Notice that the relation Attribute is generated whenever the term associated to one of the concepts is an adjective. Therefore, this semantic relation is not included in the evaluation experiment, since it would artificially increase performances. We then tested the learner on test sets for individual domains  , leading to the results shown in Table 4 a and b.</Paragraph>
      <Paragraph position="2">  . The parameter d in the above Tables is a confidence factor defined in the TiMBL algorithm. This parameter can be used to  This of course penalised the results (the performance over a test set composed by examples of all the three domains is much higher), but provides a more realistic test bed of the generality of the approach.  http://trec.nist.gov/ increase system's robustness in the following way: whenever the confidence associated by TiMBL to the classification of a new instance is lower than a given threshold, we output a &amp;quot;generic&amp;quot; conceptual relation, named Relatedness. We experimentally fixed the threshold for d around 30% (central column of Table 4).</Paragraph>
      <Paragraph position="3"> Table 4 demonstrates rather good performances, however the main problem with semantic relation annotation is the unavailability of an agreed set of conceptual relations, and a sufficiently large and balanced training set. Consequently, we need to update the set of used relations whenever we analyse a new domain, and re-run the training phase enriching the training corpus with manually tagged examples from the new domain (as for in</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="12" end_page="14" type="metho">
    <SectionTitle>
4 Qualitative evaluation: Evaluating the
</SectionTitle>
    <Paragraph position="0"> generated ontology on a per-concept basis The lesson learned during the Harmonise EC project was that the domain specialists, tourism operators in our case, can hardly evaluate the formal aspects of a computational ontology. When presented with the domain extended and trimmed version of WordNet (OntoLearn's phase 3 in Section 2), they were only able to express a generic judgment on each node of the hierarchy, based on the concept label. These judgments were used to evaluate the terminology extraction task, but the experiment suggested that, indeed, it was necessary to provide a better description for the learned concepts.</Paragraph>
    <Section position="1" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
4.1 Gloss generation grammar
</SectionTitle>
      <Paragraph position="0"> To help human evaluation on a per-concept basis, we decided to enhance OntoLearn with a gloss generation algorithm. The idea is to generate glosses in a way that closely reflects the key aspects of the concept learning process, i.e.</Paragraph>
      <Paragraph position="1"> semantic disambiguation and annotation with a conceptual relation.</Paragraph>
      <Paragraph position="2"> The gloss generation algorithm is based on the definition of a grammar with distinct generation rules for each type of semantic relation.</Paragraph>
      <Paragraph position="4"> be the complex concept associated to a complex term w</Paragraph>
      <Paragraph position="6"> (e.g. jazz festival, or long-term debt), and let: &lt;H&gt;= the syntactic head of w  of the WordNet gloss of &lt;HYP&gt; &lt;MSGM&gt;= the main sentence of the WordNet gloss of the selected sense for &lt;M&gt; Here we provide two examples of rules for generating GNCs: If sem_rel=Topic, &lt;GNC&gt;:: = a kind of &lt;HYP&gt;, &lt;MSGHYP&gt;, relating to the &lt;M&gt;, &lt;MSGM&gt;. e,g.: GNC(jazz festival): a kind of festival, a day or period of time set aside for feasting and celebration, relating to the jazz, a style of dance music popular in the 1920.</Paragraph>
      <Paragraph position="7"> If sem_rel=Attribute, &lt;GNC&gt;:= a kind of &lt;HYP&gt;, &lt;MSGHYP&gt;, &lt;MSGM&gt;.</Paragraph>
      <Paragraph position="8"> e.g.:GNC(long term debt)= a kind of debt, the state of owing something (especially money), relating to or extending over a relatively long time.</Paragraph>
    </Section>
    <Section position="2" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
4.2 Per-concept evaluation experiment
</SectionTitle>
      <Paragraph position="0"> To verify the utility of gloss generation, the automatically generated glosses were submitted for evaluation to two human experts, a tourism specialist from ECCA  , and an economist from the University of Ancona. The specialists were not aware of the method used to generate glosses; they have been presented with a list of concept-gloss pairs and asked to fill in an evaluation form (see Appendix) as follows: vote 1 means &amp;quot;unsatisfactory definition&amp;quot;, vote 2 means &amp;quot;the definition is helpful&amp;quot;, vote 3 means &amp;quot;the definition is fully acceptable&amp;quot;. Whenever he was not fully happy with a definition (vote 2 or 1), the specialist was asked to provide a brief explanation. For comparison, Appendix 2 shows also glossary definitions extracted from the web for the same MWEs, that were not shown to the specialists.</Paragraph>
      <Paragraph position="1"> Table 5 provides a summary of the evaluation..</Paragraph>
      <Paragraph position="2">  The following conclusions can be drawn from this experiment: 1. Overall, the two domain specialists fully accepted the system's choices in 45-49% of the cases, and were reasonably satisfied in 12-14%  The main sentence is the gloss pruned of subordinates, examples, etc.</Paragraph>
      <Paragraph position="3">  ECCA - eTourism Competence Center Austria. of the cases. The average vote is above 2 in both cases.</Paragraph>
      <Paragraph position="4"> 2. As expected, if a MWE is compositional, the generated definition is more often accepted or fully accepted (e.g. examples 25_E and 14_T in Appendix 2). When a compositional interpretation is not accepted (vote=1), this is motivated either by an OntoLearn interpretation error (wrong sense or wrong conceptual relations) or by the unavailability of a correct sense in WordNet, despite the fact that the sense is not idiosyncratic. OntoLearn errors for compositional MWEs are 7 (5%) in Economy and 12 (13%) in Tourism. Examples of OntoLearn errors and core ontology &amp;quot;misses&amp;quot; are the definitions 14_T (wrong sense of form) and 19_E (no good sense for bilateral in WordNet), respectively.</Paragraph>
      <Paragraph position="5"> 3. Sometimes the specialists found it acceptable also an idiosyncratic or non compositional definition. This happens in 16 cases for the Tourism domain (16%) and in 19 cases for the Economy domain (13%). Examples are the MWEs 45_E and 76_E, both idiosyncratically decomposable, in Appendix 2.</Paragraph>
      <Paragraph position="6"> One of the specialists is particularly involved in ontology building projects, therefore we report his valuable comment: &amp;quot;some of the descriptions would not be appropriate to take them over in a tourism ontology just as they are. But most of them are quite helpful as basis for building the ontology. The most important problem from my point of view is the too detailed descriptions of the components itself instead of the meaning of the overall term in this context. Best example is the term &amp;quot;bed tax&amp;quot;. Nobody would expect a definition of a bed or a tax.&amp;quot; In other terms, he found disturbing the fact that a definition extensively reports the definitions of its components. On the other side, our objective is not only to produce concept definitions, but also to organize concepts in hierarchies. Showing the definitions of individual components is a &amp;quot;natural&amp;quot; mean to verify that the correct senses have been selected (e.g. the correct senses of bed and tax). This is clearly the case, since, for example in definition 14_T (booking form), the specialist was immediately able to diagnose a sense disambiguation error for form, though he was unaware of the OntoLearn methodology.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="14" end_page="14" type="metho">
    <SectionTitle>
5 Concluding remarks
</SectionTitle>
    <Paragraph position="0"> This paper presented an in-depth evaluation of the Ontolearn ontology learning system. The three basic algorithms (terminology extraction, sense disambiguation and annotation with semantic relation) have been individually evaluated in two domains, under different parametrizations, to obtain a realistic and comprehensible picture of system's capabilities. The critical algorithm, SSI, has very good performances that are favored by the fact that word sense disambiguation is applied to group of words (domain MWEs) that are strongly semantically related, unlike for generic WSD tasks (e.g. Senseval). The performance of the SSI algorithm can be further improved through an extension of the grammar G, which is an on-going research activity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML