File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1303_evalu.xml
Size: 6,612 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1303"> <Title>Using Domain-Specific Verbs for Term Classification</Title> <Section position="6" start_page="4" end_page="4" type="evalu"> <SectionTitle> 5 Experiments and Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.1 Resources </SectionTitle> <Paragraph position="0"> The resources used for the experiments include an ontology and a corpus, both belonging to the domain of biomedicine. We used an ontology, which is a part of the UMLS (Unified Medical Language System) knowledge sources (UMLS, 2002).</Paragraph> <Paragraph position="1"> UMLS integrates biomedical information from a variety of sources and is regularly updated.</Paragraph> <Paragraph position="2"> Knowledge sources maintained under the UMLS project include: METATHESAURUS linking term variants referring to the same concepts; SPECIALIST LEXICON providing syntactic information for terms, their component words, and general Note that when a = 0, the classification method resembles the nearest neighbour classification method, where the examples are used as a training set. On the other hand, when a = 1, the method is similar to naive Bayesian learning. However, in both cases the method represents a modification of the mentioned approaches, as the classes used in formula (1) are not all classes, but the ones learned by the GA.</Paragraph> <Paragraph position="3"> English words; and SEMANTIC NETWORK containing information about the classes to which all METATHESAURUS concepts have been assigned.</Paragraph> <Paragraph position="4"> The knowledge sources used in our term classification experiments include METATHESAURUS and SEMANTIC NETWORK. As the number of terms in METATHESAURUS was too large (2.10 million terms) and the classification scheme too broad (135 classes) for the preliminary experiments, we made a decision to focus only on terms belonging to a subtree of the global hierarchy of the SEMANTIC NETWORK. The root of this subtree refers to substances, and it contains 28 classes.</Paragraph> <Paragraph position="5"> The corpus used in conjunction with the above ontology consisted of 2082 abstracts on nuclear receptors retrieved from the MEDLINE database (MEDLINE, 2003). The majority of terms found in the corpus were related to nuclear receptors and other types of biological substances, as well as the domain-specific verbs extracted automatically from the corpus in the way described in Section 3.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.2 Evaluation Framework </SectionTitle> <Paragraph position="0"> When retrieving terms found in the context of domain-specific verbs (see Section 3 for details) both terms found in the ontology and terms recognised on the fly by the C/NC-value method should be extracted. However, for the purpose of evaluation, only terms classified in the ontology were used. In that case, it was possible to automatically verify whether such terms were correctly classified by comparing the classes suggested by the classification method to the original classification information found in the ontology.</Paragraph> <Paragraph position="1"> During the phase of retrieving the verb-term combinations, some of the terms were singled out for testing. Namely, for each verb, 10% of the retrieved terms were randomly selected for testing, and the union of all such terms formed a testing set (138 terms) for the classification task. The remaining terms constituted a training set (1618 terms) and were used for the learning of complementation patterns.</Paragraph> </Section> </Section> <Section position="7" start_page="4" end_page="4" type="evalu"> <SectionTitle> 5.3 Results </SectionTitle> <Paragraph position="0"> Based on the training set, domain-specific verbs were associated with the complementation patterns given (see Table 1 for examples). Then, each term from the training set was associated with the verb it most frequently co-occurred with. The complementation pattern learnt for that verb was used to classify the term in question.</Paragraph> <Paragraph position="1"> Since the UMLS ontology contains a number of prototypical examples for each class, we have used these class representatives to compare unclassified terms to their potential classes as indicated in Section 4. Table 2 shows the results for some of the terms from the testing set and compares them to the correct classifications obtained from the ontology. null Note that in UMLS one term can be assigned to multiple classes. We regarded a testing term to be correctly classified if the automatically suggested class was among these classes. Table 3 provides information on the performance of the classification method for each of the considered verbs separately and for the combined approach in which the verb most frequently co-occurring with a given term was used for its classification. The combined approach provided considerably higher recall (around 50%) and a slight improvement in precision (around 64%) compared to average values obtained with the same method for each of the verbs separately. The classification precision did not tend to very considerably, and was not affected by the recall values. The recall could be improved by taking into account more domain-specific verbs, while the improvement of precision depends on proper tuning of: (1) the module for learning the verb complementation patterns, and (2) the similarity measure used for the classification. Another possibility is to generalize the classification method by relying on domain-specific lexico-syntactic patterns instead of verbs. Such patterns would have higher discriminative power than verbs alone. Moreover, they could be acquired automatically. For instance, the CP-value method can be used for their extraction from a corpus (Nenadic et al., 2003a).</Paragraph> <Paragraph position="2"> The values for precision and recall provided in Table 3 refer to the classification method itself. If it were to be used for the automatic ontology update, then the success rate of such update would also depend on the performance of the term recognition method, as the classification module would operate on its output. We used the C/NC-value method for ATR; still any other method may be used for this purpose. We have chosen the C/NCvalue method because it is constantly improving and is currently performing around 72% recall and 98% precision (Nenadic et al., 2002).</Paragraph> </Section> class="xml-element"></Paper>