File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1806_intro.xml
Size: 9,311 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1806"> <Title>Automatically Inducing Ontologies from Corpora</Title> <Section position="4" start_page="37" end_page="37" type="intro"> <SectionTitle> . It </SectionTitle> <Paragraph position="0"> is therefore infeasible to construct PRONTO by hand from scratch. PRONTO is also much larger than other ontologies in the biology area; for example, the Gene Ontology is rather high-level, and contains (as of March 2004) only about 17,000 terms.</Paragraph> <Section position="1" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 2.1 System Architecture </SectionTitle> <Paragraph position="0"> An overall architecture for domain-independent ontology induction is shown in Figure 1. The documents are preprocessed to separate out headers. Next, terms are extracted using finite-state syntactic parsing and scored to discover domain-relevant terms. The subsequent processing infers semantic relations between pairs of terms using the 'weak' knowledge sources run in the order described below. Evidence from multiple knowledge sources is then combined to infer the resulting relations. The resulting ontologies are written out in a standard XML-based format (e.g., XOL, RDF, OWL), for use in various information access applications.</Paragraph> <Paragraph position="1"> While the ontology induction procedure does not involve human labor, except for writing the preprocessing and term tokenization program for specialized technical domains, the human may edit the resulting ontology for use in a given application. An ontology editor has been developed, discussed briefly in Section 3.1.</Paragraph> </Section> <Section position="2" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 2.2 Term Discovery </SectionTitle> <Paragraph position="0"> The system takes a collection of documents in a subject area, and identifies terms characteristic of the domain. In a given domain such as CompuTerm 2004 - 3rd International Workshop on Computational Terminology48 bioinformatics, specialized term tokenization (into single- and multi-word terms) is required. The protein names can be long, e.g., &quot;steroid/thyroid/retinoic nuclear hormone receptor homolog nhr-35&quot;, and involve specialized patterns. In constructing PRONTO, we have used a protein name tagger based on an ensemble of statistical classifiers to tag protein names in collections of MEDLINE abstracts (Anon 2004). Thus, in such a domain, a specialized tagger replaces the components in the dotted box in Figure 1.</Paragraph> <Paragraph position="1"> In other domains, we adopt a generic termdiscovery approach. Here the text is tagged for part-of-speech, and single- and multi-word terms consisting of minimal NPs are extracted using finite-state parsing with CASS (Abney 1996). All punctuation except for hyphens are removed from the terms, which are then lower-cased. Each word in each term is stemmed, with statistics (see below) being gathered for each stemmed term. Multi-word terms are clustered so that open, closed and hyphenated compounds are treated as equivalent, with the most frequent term in the collection being used as the cluster representative.</Paragraph> <Paragraph position="2"> The terms are scored for domain-relevance based on the assumption that if a term occurs significantly more in a domain corpus than in a more diffuse background corpus, then the term is clearly domain relevant.</Paragraph> <Paragraph position="3"> As an illustration, in Table 1 we compare the number of documents containing the term 'income tax' (or 'income taxes') in a long (2.18 Mb) IRS publication, Publication 17, from an IRS web site (IRS 2001) compared to a larger (27.63 Mb subset of the) Reuters 21578 news corpus . One would expect that 'income tax' is much more a characteristic of the IRS publication, and this is borne out by the document frequencies in the table. We use the log likelihood ratio (LLR) (Dunning LLR measures the extent to which a hypothesized model of the distribution of cell counts, H a , differs from the null hypothesis, H o (namely, that the percentage of documents containing this term is the same in both corpora). We used a binomial model for H</Paragraph> </Section> <Section position="3" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 2.3 Relationship Discovery </SectionTitle> <Paragraph position="0"> The main innovation in our approach is to fuse together information from multiple knowledge In Publication 17, each &quot;chapter&quot; is a document. =19043.</Paragraph> <Paragraph position="1"> sources as evidence for particular semantic relationships between terms. To infer semantic relations such as kind-of and part-of, the system uses a bottom-up data-driven approach using a combination of evidence from shallow methods. These are based on the presence of common syntactic heads, and allow us to infer, for example, that 'p68 protein' is a kind-of 'protein'. Likewise, in the TREC domain, subphrase analysis tells us that 'electric car' is a kind of 'car', and in the IRS domain, that 'federal income tax' is a kind of 'income tax'.</Paragraph> <Paragraph position="2"> These are obtained from a thesaurus. For example, the Gene Ontology can be used to infer that 'ATP-dependent RNA helicase' is a kind of 'RNA-helicase'. Likewise, in the TREC domain, using WordNet tells us that 'tailpipe' is part of 'automobile', and in the IRS domain, that 'spouse' is a kind of 'person'. Synonyms are also merged together at this stage.</Paragraph> <Paragraph position="3"> We also infer hierarchical relations between terms, by top-down clustering using a context-based subsumption (CBS) algorithm. The algorithm uses a probabilistic measure of set covering to find subsumption relations. For each term in the corpus, we note the set of contexts in which the term appears. Term1 is said to subsume term2 when the conditional probability of term1 appearing in a context given the presence of term2, i.e., P(term1|term2), is greater than some threshold. CBS is based on the algorithm of (Lawrie et al. 2001), which used a greedy approximation of the Domination Set Problem for graphs to discover subsumption relations among terms. Unlike their work, we did not seek to minimize the set of covering terms; therefore, a subsumed term may have multiple parents. The conditional probability threshold (0.8) we use to determine subsumption is much higher than in their approach. We also restrict the height of the hierarchies we build to three tiers. Tightening these latter two constraints appears to notably improve the quality of our subsumption relations.</Paragraph> <Paragraph position="4"> The largest corpus against which CBS has run is the ProMed corpus where, considering each paragraph a distinct context, there were 117,690 contexts in the 11,198 documents. Here is an example from ProMed of a transitive relation that spans three tiers: 'mosquito' is a hypernym of 'mosquito pool', and 'mosquito' is also a hypernym of 'standing water'.</Paragraph> <Paragraph position="5"> CompuTerm 2004 - 3rd International Workshop on Computational Terminology 49 This knowledge source infers specific relations between terms based on characteristic cue-phrases which relate them. For example, the cue-phrase &quot;such as&quot; (Hearst 1992) (Caraballo 1999) suggest a kind-of relation, e.g., 'a ligand such as triethylphosphine' tells us that 'triethylphosphene' is a kind of 'ligand'. Likewise, in the TREC domain, 'air toxics such as benzene' can suggest that 'benzene' is a kind of 'air toxic'. However, since such cue-phrase patterns tend to be sparse in occurrence, we do not use them in the evaluations described below.</Paragraph> <Paragraph position="6"> Although our approach is domain-independent, it is possible to factor in domain knowledge sources for a given domain. For example, in biology, 'ase' is usually a suffix indicating an enzyme.</Paragraph> <Paragraph position="7"> Postmodifying PPs (found using a CASS grammar) can also be useful in some domains, as shown in 'tax on investment income of child' in Figure 2. We have so far, however, not investigated other domain-specific knowledge sources.</Paragraph> </Section> <Section position="4" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 2.4 Evidence Combination </SectionTitle> <Paragraph position="0"> The main point about these and other knowledge sources is that each may provide only partial information. Combining these knowledge sources together, we expect, will lead to superior performance compared to just any one of them.</Paragraph> <Paragraph position="1"> Not only do inferences from different knowledge sources support each other, but they are also combined to produce new inferences by transitivity relations. For example, since phrase analysis tells us that 'pyridine metabolism' is a kind-of 'metabolism', and Gene Ontology tells us that 'metabolism' is a kind-of 'biological process', it follows that 'pyridine metabolism' is a kind-of 'biological process'. The evidence combination, in addition to computing transitive closure of these relations, also detects inconsistencies, querying the user to resolve them when detected.</Paragraph> </Section> </Section> class="xml-element"></Paper>