File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0508_metho.xml
Size: 23,663 bytes
Last Modified: 2025-10-06 14:10:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0508"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A hybrid approach for extracting semantic relations from texts</Title> <Section position="5" start_page="57" end_page="60" type="metho"> <SectionTitle> 3 A hybrid approach for relation ex- </SectionTitle> <Paragraph position="0"> traction The proposed approach for relation extraction is illustrated in Figure 1. It employs knowledge-based and (supervised and unsupervised) corpus-based techniques. The core strategy consists of mapping linguistic components with some syntactic relationship (a linguistic triple) into their corresponding semantic components. This includes mapping not only the relations, but also the terms linked by those relations. The detection of the linguistic triples involves a series of linguistic processing steps. The mapping between terms and concepts is guided by a domain ontology and a named entity recognition system. The identification of the relations relies on the knowledge available in the domain ontology and in a lexical database, and on pattern-based classification and sense disambiguation models.</Paragraph> <Paragraph position="1"> The main goal of this approach is to provide rich semantic annotations for the Semantic Web.</Paragraph> <Paragraph position="2"> Other potential applications include: 1) Ontology population: terms are mapped into new instances of concepts of an ontology, and relations between them are identified, according to the possible relations in that ontology. 3) Ontology learning: new relations between existent concepts are identified, and can be used as a first step to extend an existent ontology. A subsequent step to lift relations between instances to an adequate level of abstraction may be necessary.</Paragraph> <Section position="1" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.1 Context and resources </SectionTitle> <Paragraph position="0"> The input to our experiments consists of electronic Newsletter Texts2. These are short texts describing news of several natures related to members of a research group: projects, publications, events, awards, etc. The domain Ontology used (KMi-basic-portal-ontology) was designed based on the AKT reference ontology3 to include concepts relevant to our domain. The instantiations of concepts in this ontology are stored in the knowledge base (KB) KMi-basic-portal-kb.</Paragraph> <Paragraph position="1"> The other two resources used in our architecture are the lexical database WordNet (Fellbaum, 1998) and a repository of Patterns of relations, described in Section 3.4.</Paragraph> </Section> <Section position="2" start_page="58" end_page="59" type="sub_section"> <SectionTitle> 3.2 Identifying linguistic triples </SectionTitle> <Paragraph position="0"> Given a newsletter text, the first step of the relation extraction approach is to process the natural language text in order to identify linguistic triples, that is, sets of three elements with a syntactic relationship, which can indicate potentially relevant semantic relations. In our architecture, this is accomplished by the Linguistic Component module, and adaptation of the linguistic component designed in Aqualog (Lopez et al., 2005), a question answering system.</Paragraph> <Paragraph position="1"> The linguistic component uses the infrastructure and the following resources from GATE (Cunningham et al., 2002): tokenizer, sentence splitter, part-of-speech tagger, morphological analyzer and VP chunker. On the top of these resources, which produce syntactic annotations for the input text, the linguistic component uses a grammar to identify linguistic triples. This grammar was implemented in Jape (Cunningham et al., 2000), which allows the definition of patterns to recognize regular expressions using the annotations provided by GATE.</Paragraph> <Paragraph position="2"> The main type of construction aimed to be identified by our grammar involves a verbal expression as indicative of a potential relation and two noun phrases as terms linked by that relation. However, our patterns also account for other types of constructions, including, e.g., the use of comma to implicitly indicate a relation, as in sentence (1). In this case, when mapping the terms into entities (Section 3.3), having identified that &quot;KMi&quot; is an organization and &quot;Enrico Motta&quot; is a person, it is possible to guess the relation indicated by the comma (e.g., work). Some examples triples identified by our patterns for the newsletter in Figure 2 are given in Figure 3.</Paragraph> <Paragraph position="3"> (1) &quot;Enrico Motta, at KMi now, is leading a project on ....&quot;.</Paragraph> <Paragraph position="4"> newsletter in Figure 2 Jape patterns are based on shallow syntactic information only, and therefore they are not able to capture certain potentially relevant triples. To overcome this limitation, we employ a parser as a complementary resource to produce linguistic triples. We use Minipar (Lin, 1993), which produces functional relations for the components in a sentence, including subject and object relations with respect to a verb. This allows capturing some implicit relations, such as indirect objects and long distance dependence relations.</Paragraph> <Paragraph position="5"> Minipar's representation is converted into a triple format and therefore the intermediate representation provided by both GATE and Minipar consists of triples of the type: <noun_phrase, verbal_expression, noun_phrase>.</Paragraph> </Section> <Section position="3" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 3.3 Identifying entities and relations </SectionTitle> <Paragraph position="0"> Given a linguistic triple, the next step is to verify whether the verbal expression in that triple conveys a relevant semantic relationship between entities (given by the terms) potentially belonging to an ontology. This is the most important phase of our approach and is represented by a series of modules in our architecture in Figure 1.</Paragraph> <Paragraph position="1"> As first step we try to map the linguistic triple into an ontology triple, by using an adaptation of</Paragraph> </Section> <Section position="4" start_page="59" end_page="60" type="sub_section"> <SectionTitle> Aqualog's Relation Similarity Service (RSS). </SectionTitle> <Paragraph position="0"> RSS tries to make sense of the linguistic triple by looking at the structure of the domain ontology and the information stored in the KB. In order to map a linguistic triple into an ontology triple, besides looking for an exact matching between the components of the two triples, RSS considers partial matching by using a set of resources in order to account for minor lexical or conceptual discrepancies between these two elements. These resources include metrics for string similarity matching, synonym relations given by WordNet, and a lexicon of previous mappings between the two types of triples. Different strategies are employed to identify a matching for terms and relations, as we describe below.</Paragraph> <Paragraph position="1"> Since we do not consider any interaction with the user in order to achieve a fully automated annotation process, other modules were developed to complete the mapping process even if there is no matching (Section 3.4) or if there is ambiguity (Section 3.5), according to RSS.</Paragraph> <Paragraph position="2"> Strategies for mapping terms To map terms into entities, the following attempts are accomplished (in the given order): 1) Search the KB for an exact matching of the term with any instance.</Paragraph> <Paragraph position="3"> 2) Apply string similarity metrics4 to calculate the similarity between the given term and each instance of the KB. A hybrid scheme combining three metrics is used: jaro-Winkler, jlevelDistance a wlevelDistance. Different combinations of threshold values for the metrics are considered. The elements in the linguistic triples are lemmatized in order to avoid problems which could be incorrectly handled by the string similarity metrics (e.g., past tense).</Paragraph> <Paragraph position="4"> 2.1) If there is more that one possible matching, check whether any of them is a substring of the term. For example, the instance name for &quot;Enrico Motta&quot; is a substring of the term &quot;Motta&quot;, and thus it should be preferred. For example, the similarity values returned for the term &quot;vanessa&quot; with instances potentially relevant for the mapping are given in Figure 4.</Paragraph> <Paragraph position="5"> The combination of thresholds is met for the instance &quot;Vanessa Lopez&quot;, and thus the mapping is accomplished. If there is still more than one possible mapping, we assume there is not enough evidence to map that term and discard the triple.</Paragraph> <Paragraph position="6"> Strategies for mapping relations In order to map the verbal expression into a conceptual relation, we assume that the terms of the triple have already been mapped either into instances of classes in the KB by RSS, or into potential new instances, by a named entity recognition system (as we explain in the next section). The following attempts are then made for the verb-relation mapping: 1) Search the KB for an exact matching of the verbal expression with any existent relation for the instances under consideration or any possible relation between the classes (and superclasses) of the instances under consideration.</Paragraph> <Paragraph position="7"> 2) Apply the string similarity metrics to calculate the similarity between the given verbal expression and the possible relations between instances (or their classes) corresponding to the terms in the linguistic triple.</Paragraph> <Paragraph position="8"> 3) Search for similar mappings for the types/classes of entities under consideration in a lexicon of mappings automatically created according to users' choices in the question answering system Aqualog. This lexicon contains ontology triples along with the original verbal expression, as illustrated in Table 1. The use of this lexicon represents a simplified form of pattern matching in which only exact matching is considered. null given_relation class_1 conceptual relation class_2 works project has-project-member person</Paragraph> </Section> </Section> <Section position="6" start_page="60" end_page="62" type="metho"> <SectionTitle> 4) Search for synonyms of the given verbal </SectionTitle> <Paragraph position="0"> expression in WordNet, in order to verify if there is a synonym that matches (complete or partially, using string similarity metrics) any existent relation for the instances under consideration, or any possible relation between the classes (or superclasses) of those instances (likewise in step 1).</Paragraph> <Paragraph position="1"> If there is no possible mapping for the term, the pattern-based classification model is triggered (Section 3.4). Conversely, if there is more than one possible mapping, the disambiguation model is called (Section 3.5).</Paragraph> <Paragraph position="2"> The application of these strategies to map the linguistic triples into existent or new instances and relations is described in what follows.</Paragraph> <Paragraph position="3"> Applying RSS to map entities and relations In our architecture, RSS is represented by modules RSS_1 and RSS_2. RSS_1 first checks if the terms in the linguistic triple are instances of a KB (cf. strategies for mapping terms). If the terms can be mapped to instances, it checks whether the relation given in the triple matches any already existent relation between for those instances, or, alternatively, if that relation matches any of the possible relations for the classes (and superclasses) of the two instances in the domain ontology (cf. strategies for mapping relations). Three situations may arise from this attempt to map the linguistic triple into an ontology triple (Cases (1), (2), and (3) in Fig. 1): Case (1): complete matching with instances of the KB and a relation of the KB or ontology, with possibly more than one valid conceptual relation being identified: <instance1, (conceptual_relation)+, instance2>.</Paragraph> <Paragraph position="4"> Case (2): no matching or partial matching with instances of the ontology (the relation is not analyzed (na) when there is not a matching for instances): <instance1, na , ?> or <?, na, instance2> or <?, na, ?> Case (3): matching with instances of the KB, but no matching with a relation of the KB or ontology: null <instance1, ?, instance2> If the matching attempt results in Case (1) with only one conceptual relation, then the triple can be formalized into a semantic annotation. This yields the annotation of an already existent relation for two instances of the KB, as well as a new relation for two instances of the KB, although this relation was already predicted in the ontology as possible between the classes of those instances. The generalization of the produced triple for classes/types of entities, i.e., <class, conceptual_relation, class>, is added to the repository of Patterns.</Paragraph> <Paragraph position="5"> On the other hand, if there is more than one possible conceptual relation in case (1), the system tries to find the correct one by means of a sense disambiguation model, described in Section 3.5. Conversely, if there is no matching for the relation (Case (3)), the system tries an alternative strategy: the pattern-based classification model (Section 3.4). Finally, if there is no complete matching of the terms with instances of the KB (Case (2)), it means that the entities can be new to the KB.</Paragraph> <Paragraph position="6"> In order to check if the terms in the linguistic triple express new entities, the system first iden- null tifies to what classes of the ontology they belong. This is accomplished by means of ESpotter++, and extension of the named entity recognition system ESpotter (Zhu et al, 2005).</Paragraph> <Paragraph position="7"> ESpotter is based on a mixture of lexicon (gazetteers) and patterns. We extended ESpotter by including new entities (extracted from other gazetteers), a few relevant new types of entities, and a small set of efficient patterns. All types of entities correspond to generic classes of our domain ontology, including: person, organization, event, publication, location, project, researcharea, technology, etc.</Paragraph> <Paragraph position="8"> In our architecture, if ESpotter++ is not able to identify the types of the entities, the process is aborted and no annotation is produced. This may be either because the terms do not have any conceptual mapping (for example &quot;it&quot;), or because the conceptual mapping is not part of our domain ontology. Otherwise, if ESpotter++ succeeds, RSS is triggered again (RSS_2) in order to verify whether the verbal expression encompasses a semantic relation. Since at least one of the two entities is recognized by Espotter++, and therefore at least one entity is new, it is only possible to check if the relation matches the possible relations between the classes of the recognized entities (cf. strategies for mapping relations).</Paragraph> <Paragraph position="9"> If the matching attempt results in only one conceptual relation, then the triple will be formalized into a semantic annotation. This represents the annotation of a new (although predicted) relation and two or at least one new entity/instance. The produced triple of the type <class, conceptual_relation, class> is added to the repository of Patterns.</Paragraph> <Paragraph position="10"> Again, if there are multiple valid conceptual relations, the system tries to find the correct one by means of a disambiguation model (Section 3.5). Conversely, if it there is no matching for the relation, the pattern-based classification model is triggered (Section 3.4).</Paragraph> <Section position="1" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 3.4 Identifying new relations </SectionTitle> <Paragraph position="0"> The process described in Section 3.3 for the identification of relations accounts only for the relations already predicted as possible in the domain ontology. However, we are also interested in the additional information that can be provided by the text, in the form of new types of relations for known or new entities. In order to discover these relations, we employ a pattern matching strategy to identify relevant relations between types of terms.</Paragraph> <Paragraph position="1"> The pattern matching strategy has proved to be an efficient way to extract semantic relations, but in general has the drawback of requiring the possible relations to be previously defined. In order to overcome this limitation, we employ a Pattern-based classification model that can identify similar patterns based on a very small initial number of patterns.</Paragraph> <Paragraph position="2"> We consider patterns of relations between types of entities, instead of the entities themselves, since we believe that it would be impossible to accurately judge the similarity for the kinds of entities we are addressing (names of people, locations, etc). Thus, our patterns consist of triples of the type <class, conceptual_relation, class>, which are compared against a given triple using its classes (already provided by the linguistic component or by ESpotter++) in order to classify relations in that triple as relevant or nonrelevant. null The classification model is based on the approach presented in (Stevenson, 2004). It is an unsupervised corpus-based module which takes as examples a small set of relevant SVO patterns, called seed patterns, and uses a WordNet-based semantic similarity measure to compare the pattern to be classified against the relevant ones.</Paragraph> <Paragraph position="3"> Our initial seed patterns (see examples in Table 2) mixes patterns extracted from the lexicon generated by Aqualog's users (cf. Section 3.3) and a small number of manually defined relevant patterns. This set of patterns is expected to be enriched with new patterns as our system annotates relevant relations, since the system adds new triples to the initial set of patterns.</Paragraph> <Paragraph position="4"> class_1 conceptual relation class_2 project has-project-member person project has-publication publication person develop technology person attend event Likewise (Stevenson, 2004), we use a semantic similarity metric based on the information content of the words in WordNet hierarchy, derived from corpus probabilities. It scores the similarity between two patterns by computing the similarity for each pair of words in those patterns. A threshold of 0.90 for this score was used here to classify two patterns as similar. In that case, a new annotation is produced for the input triple and it is added to the set of patterns.</Paragraph> <Paragraph position="5"> It is important to notice that, although Word-Net is also used in the RSS module, in that case only synonyms are checked, while here the similarity metric explores deeper information in WordNet, considering the meaning (senses) of the words. It is also important to distinguish the semantic similarity metrics employed here from the string metrics used in RSS. String similarity metrics simply try to capture minor variations on the strings representing terms/relations, they do not account for the meaning of those strings.</Paragraph> </Section> <Section position="2" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 3.5 Disambiguating relations </SectionTitle> <Paragraph position="0"> The ambiguity arising when more than one possible relation exists for a pair of entities is a problem neglected in most of the current work on relation extraction. In our architecture, when the RSS finds more than one possible relation, we choose one relation by using the word sense disambiguation (WSD) system SenseLearner (Mihalcea and Csomai, 2005).</Paragraph> <Paragraph position="1"> SenseLearner is supervised WSD system to disambiguate all open class words in any given text, after being trained on a small data set, according to global models for word categories.</Paragraph> <Paragraph position="2"> The current distribution includes two default models for verbs, which were trained on a corpus containing 200,000 content words of journalistic texts tagged with their WordNet senses. Since SenseLeaner requires a sense tagged corpus in order to be trained to specific domains and there is not such a corpus for our domain, we use one of the default training models. This is a contextual model that relies on the first word before and after the verb, and its POS tags. To disambiguate new cases, it requires only that the words are annotated with POS tags. The use of lemmas of the words instead of the words yields better results, since the models were generated for lemmas. In our architecture, these annotations are produced by the component POS + Lemmatizer.</Paragraph> <Paragraph position="3"> Since the WSD module disambiguates among WordNet senses, it is employed only after the use of the WordNet subcomponent by RSS. This subcomponent finds all the synonyms for the verb in a linguistic triple and checks which of them matches existent or possible relations for the terms in that triple. In some cases, however, there is a matching for more than one synonym.</Paragraph> <Paragraph position="4"> Since in WordNet synonyms usually represent different uses of the verb, the WSD module can identify in which sense the verb is being used in the sentence, allowing the system to choose one among all the matching options.</Paragraph> <Paragraph position="5"> For example, given the linguistic triple <enrico_motta, head, kmi>, RSS is able to identify that &quot;enrico_motta&quot; is a person, and that &quot;kmi&quot; is an organization. However, it cannot find an exact or partial matching (using string metrics), or even a matching (given by the user lexicon) for the relation &quot;head&quot;. After getting all its synonyms in WordNet, RSS verifies that two of them match possible relations in the ontology between a person and an organization: &quot;direct&quot; and &quot;lead&quot;. In this case, the WSD module disambiguates the sense of &quot;head&quot; as &quot;direct&quot;.</Paragraph> </Section> <Section position="3" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 3.6 Example of extracted relations </SectionTitle> <Paragraph position="0"> As an example of the relations that can be extracted in our approach, consider the representation of the entity &quot;Enrico Motta&quot; and all the relations involving this entity in Figure 5. The relations were extracted from the text in Figure 6.</Paragraph> <Paragraph position="1"> news in Figure 5 In this case, &quot;Enrico-Motta&quot; is an instance of kmi-academic-staff-member, a subclass of person in the domain ontology. The mapped relation &quot;works-in&quot; &quot;knowledge-media-institute&quot; already existed in the KB. The new relations pointed out by our approach are the ones referring to the award received from the &quot;European Commission&quot; (an organization, here), for three projects: &quot;NeOn&quot;, &quot;XMEDIA&quot;, and &quot;OK&quot;.</Paragraph> </Section> </Section> class="xml-element"></Paper>