File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/84/p84-1059_metho.xml
Size: 20,707 bytes
Last Modified: 2025-10-06 14:11:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P84-1059"> <Title>Building a Large Knowledge Base for a Natural Language System</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> SRI International </SectionTitle> <Paragraph position="0"> and Center for the Study of Language and Information</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> A sophisticated natural language system requires a large knowledge base. A methodology is described for constructing one in a principled way. Facts are selected for the knowledge base by determining what facts are linguistically presupposed by a text in the domain of interest. The facts are sorted into clnsters, and within each cluster they are organized according to their logical dependencies. Finally, the facts are encoded as predicate calculus axioms.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. The Problem I </SectionTitle> <Paragraph position="0"> It is well-known that the interpretation of natural language discourse can require arbitrarily detailed world knowledge and that a sophisticated natural language system must have a large knowledge base. But heretofore, the knowledge bases in natural language systems have either encoded only a few kinds of knowledge - e.g., sort hierarchies - or facts in only very narrow domains. The aim of this paper is to present a methodology for constructing an intermediate-size knowledge base for a natural language system, which constitutes a manageable and principled midway point between these simple knowledge bases and the impossibly detailed knowledge bases that people seem to use.</Paragraph> <Paragraph position="1"> The work described in this paper has been carried out as part of a project to build a system for natural language access to a computerized medical textbook on hepatitis.</Paragraph> <Paragraph position="2"> The user asks a question in English, and rather than attempting to answer it, the system returns the passages in the text relevant to the question. The English query is translated into a logical form by a syntactic and semantic translation component \[Grosz et al., 1982\]. The textbook is represented by a &quot;text structure&quot;, consisting, among other things, of summaries of the contents of individual passages, expressed in a logical language. Inference procedures, making use of a knowledge base, seek to match the logical form i I am indebted to Bob Amsler and Don Walker for discussions concerning this work. This research was supported by NIH Grant LM03611 from the National Library of Medicine, by Grant IST-8209346 from the National Science Foundation, and by a gift from the Systems De= velopment Foundation.</Paragraph> <Paragraph position="3"> of the query with some part of the text structure. In addition, they attempt to attempt to solve various pragmatic problems posed by the query, including the resolution of coreference, metonymy, and tile implicit predicates ill compound nominals. The inference procedures are discussed elsewhere \[Walker and Hobbs, 1981\]. In this paper a brief example will have to suffice.</Paragraph> <Paragraph position="4"> Suppose the user asks the question, &quot;Can a patient with mild hepatitis engage in strenuous exercise?&quot; The relevant passage in the textbook is labelled, &quot;Management of the Patient: Requirements for Bed Rest&quot;. The inference procedures must show that this heading is relevant to this question by drawing the appropriate inferences from the knowledge base. Thus the knowledge base must contain the facts that rest is an activity that consumes little energy, that exercise is an activity, and that if something is strenuous it consumes much energy, and axioms that relate the concepts &quot;can&quot; and &quot;require&quot; via the concept of possibility. One way to build the knowledge base would have been to analyze the queries in some target dialogs we collected to determine what facts they seem to require, and to put just these facts into our knowledge base. llowever, we are interested in discovering gcneral principles of selection and structuring of such intermediate-sized knowledge bases, principles that would give us reason to believe our knowledge base would be useful for unanticipated queries.</Paragraph> <Paragraph position="5"> Thus we have developed a three-stage methodology: I. Select the facts that should be in the knowledge base by determining what facts are linguistically presul)posed by the medical textbook. This gives us a very good indication of what knowledge of the domain the user is expected to bring to the textbook and would bring to the system.</Paragraph> <Paragraph position="6"> 2. Organize the facts into clusters and organize the facts within each cluster according to the logical dependencies among the concepts they involve.</Paragraph> <Paragraph position="7"> 3. Encode the facts as predicate calculus axioms, regu null larizing the concepts, or predicates, as necessary.</Paragraph> <Paragraph position="8"> These stages are discussed in the next three sections.</Paragraph> </Section> <Section position="4" start_page="0" end_page="284" type="metho"> <SectionTitle> 2. Selecting the Facts </SectionTitle> <Paragraph position="0"> To be useful, a natural language system nmst have a large vocabulary. Moreover, when one sets out to axiomatize a domain, unl-ss one haz a rich set of predicates and facts to be respousible f-r, a sense of coherence in the axiomatizatio, i~ hard to achieve. One's efforts seem ad hoe. So the first step in building the knowledge base is to make up an extensive list of words, or predicates, or concepts (the three terms will be used interchangeably here), and an extensiv,~ list of rebwant facts about these predicates. We chose about 350 w,,rds from our target dialogs and headings in the textl,-ok :,ld encoded the relevant facts involving these con~'epts. Because there are dozens of facts one could state involving any one of these predicates, we were faced with the problem of determining those facts that would be most pertinent for natural language understanding in this domain.</Paragraph> <Paragraph position="1"> Our principal tool at this stage was a full-sentence concordance of the textbook, displaying the contexts in which the words were used. Our method was to examine these contexts and to ask what facts about each concept were required to justify each of these uses, what did their uses linguist ically presuppose.</Paragraph> <Paragraph position="2"> The three principal linguistic phenomena we looked at were predicate-argument relations, compound nominals, and conjoined phrases. As an example of the first, consider two uses of the word &quot;data&quot;. The phrase &quot;extensive data on histocoml)atibility antigens&quot; points to the fact about data that it. is a set (justifying &quot;extensive&quot;) of particular facts abo~d some subjecl (justifying the &quot;on&quot; argument). The phrase &quot;the data do not consistently show ...&quot; points to the fact that data is mssembled to support some conclusion. To arrive at the facts, we ask questions like &quot;What is data that it can be extensive or that it can show something?&quot; For coml)ound nominals we ask, &quot;What general facts about the two nouns underlie the implicit relation?&quot; So for &quot;casual contact circumstances&quot; we posit that contact is a concomitant of activities, and the phrase &quot;contact mode of transmission&quot; leads us to the fact that contact possibly leads to transmi.-:sion of an agent. Conjoined noun phrases indicate the existence of a superordinate in a sort hierarchy covering all the conjoined concepts. Thus, the phrase &quot;epidemiolo~, clinical aspects, pathology, diagnosis, and marmgement&quot; tells us to encode the facts that all of these are aspects of a disease.</Paragraph> <Paragraph position="3"> As an ill,stration of the method, let us examine various uses of the word &quot;disease&quot; to see what facts it suggests: * &quot;destructive liver disease&quot;: A disease has a harmful effect on one or more body parts.</Paragraph> <Paragraph position="4"> * &quot;hepatitis A virus plays a role in chronic liver disease&quot;: A disease may be caused by an agent.</Paragraph> <Paragraph position="5"> * &quot;the clinical manifestations of a disease&quot;: A disease is detectable by signs and symptoms.</Paragraph> <Paragraph position="6"> * &quot;the course of a disease&quot; : A disease goes through several stages in time.</Paragraph> <Paragraph position="7"> * &quot;infectious disease&quot;: A disease can be transmitted. * '% notifiable disease&quot;: A disease has patterns in the population that can be traced by the medical community. null We emphasize that this is not a mechanical procedure but a method of discovery that relies on our informed intuitions. Since it is largely background knowledge we are after, we can not expect to get it directly by interviewing experts. Our method is a way of extracting it frorn the presuppositions behind linguistic use.</Paragraph> <Paragraph position="8"> The first thing our method gives us is a great deal of selectivity in the facts we encode. Consider the word &quot;animal&quot;. There are hundreds of facts that we know about animals. However, in this domain there are only two facts we need. Animals are used in experiments, ms seen in tl-e compound nominal &quot;laboratory animal&quot;, and animals can have a disease, and thus transmit it, a.s seen in the phrase &quot;animals implicated in hepatitis&quot;. Similarly, the only relevant fact about &quot;water&quot; is that it may be a medium for the transmission of disease.</Paragraph> <Paragraph position="9"> Secondly, the method points us toward generalizations we might otherwise miss, when we see a number of uses that seem to fall within the same class. For example, the uses of the word &quot;laboratory&quot; seem to be of two kinds: 1. &quot;laboratory animals&quot;, &quot;laboratory spores&quot;, &quot;laboratory contamination&quot;, &quot;laboratory nwthods&quot;. 2. &quot;a study by a research laboratory&quot;, &quot;laboratory test null ing&quot;, &quot;laboratory abnormalities&quot;, ':laboratory characteristics of hepatitis A', &quot;laboratory picture&quot;. The first of these rests on the fact that experiments involving certain events and entities take place in laboratories. The second rests on the fact that information is acquired there.</Paragraph> <Paragraph position="10"> A classical issue in lexical semantics that arises at this stage is the problem of polysemy. Should we consider a word, or predicate, as ambiguous, or should we try to find a very general characterization of its meaning that abstracts away from its use in various contexts? The concordance method suggests a solution. The rule of thumb we have followed is this: if the uses fall into two or three distinct, large classes, the word is treated a.s having separate senses ........ whereas if the uses seem to be spread all over the map, we try to find a general characterization that covers them all. The word &quot;derive&quot; is an example of the first case. A derivation is either of information from an investigative activity, as in &quot;epidemiologic patterns derived from historical studies&quot;, or of chemicals from body parts, as in &quot;enzymes derived from intestinal mucosa&quot;. By contr,~st, the word &quot;produce&quot; (and the word &quot;product&quot;} can be used in a variety of ways: a disease can produce a condition, a virus can produce a disease or a viral particle, something can produce a virus (&quot;the amount of virus produced in the carrier state&quot;), intestinal flora can produce compounds, and something can produce chemicals from blood (&quot;blood products&quot;). All of this suggests that we want to encode only the fact that if x produces y, then x causes y to come into existence.</Paragraph> <Paragraph position="11"> At this stage in our method, we aimed at only informal, English statements of the facts. We ended up with approximately 1000 facts for the knowledge base. &quot;-</Paragraph> </Section> <Section position="5" start_page="284" end_page="285" type="metho"> <SectionTitle> 3. Organizing the Knowledge Base </SectionTitle> <Paragraph position="0"> The next step is to sort the facts into natural &quot;clusters&quot; (cf. \[Hayes, 1984\]). For example, the fact &quot;If x produces y, then x causes y to exist&quot; is a fact about causality. The fact &quot;The replication of a virus requires components of a cell of an organism&quot; is a fact about viruses. The fact &quot;A household is an environment with a high rate of intimate contact, thus a high risk of transmission&quot; is in the cluster of facts about people and their activities. The fact &quot;If bilirubin is not secreted by the liver, it may indicate injury to the liver tissues&quot; is in the medical practice cluster.</Paragraph> <Paragraph position="1"> It is useful to distinguish between clusters of &quot;core knowledge&quot; that is common to most domains and &quot;domainspecific knowledge&quot;. Among the clusters of core knowledge are space, time, belief, and goal-directed behavior. The domain-specific knowledge includes clusters of facts about viruses, imnmnology, physiology, disease, and medical practice. The cluster of facts about people and their activities lies somewhere in between these two.</Paragraph> <Paragraph position="2"> We are taking a rather novel approach to the axiomatization of core knowledge. Much of our knowledge and language seems to be based on an underlying &quot;topology&quot;, which is then instantiated in many other areas, like space, time, belief, social organizations, and so on. We have begun by axiomatizing this fundamental topology. At its base is set theory, axiomatized along traditional lines. Next is a theory of granularity, in which the key concept is &quot;x is indistinguishable from y with respect to grain g'. A theory of scalar concepts combines granularity and partial orders. The concept of change of state and the interactions of containment and causality are given (perhaps overly simple) axiomatizations. Finally there is a cluster centered around the notion of a &quot;system&quot;, which is defined as a set of entities and a set of relations among them. In the &quot;system&quot; cluster we provide an interrelated set of predicates enabling one to characterize the &quot;structure&quot; of a system, producer-consumer relations among the components, the ':function&quot; of a component of a system as a relation between the component's behavior and the behavior of the system as a whole, notions of normality, and distributions of properties among the elements of a system. The applicability of the notion of &quot;system&quot; is very wide; among the entities that can be viewed as systems are viruses, organs, activities, populations, and scientific disciplines.</Paragraph> <Paragraph position="3"> Other general commonsense knowledge is built on top of this naive topolog'y. The domain of time is seen as a particular kind of scale defined by change of state, and the axiomatization builds toward such predicates as &quot;regular&quot; and &quot;persist&quot;. The domain of belief has three principal subclusters in this application: learning, which includes such predicates as &quot;find&quot;, &quot;test&quot; and &quot;manifest&quot;; reasoning, explicating predicates such as &quot;leads-to&quot; and &quot;consistent&quot;; and classifying, with such predicates as &quot;distinguish&quot;, &quot;differentiate&quot; and &quot;identify&quot;. The domain of modalities explicates such concepts as necessity, possibility, and likelihood.</Paragraph> <Paragraph position="4"> Finally, in the domain of goal-directed behavior, we characterize such predicates as &quot;help&quot;, &quot;care&quot; and &quot;risk&quot;. In the lowest-level domain-specific clusters - viruses, immunology, physiology, and people and their activities we begin by specifying their ontology (the different sorts of entities and classes of entities in the cluster), tile inclusion relations among the classes, the behaviors of entities in the clusters and their interactions with other entities.</Paragraph> <Paragraph position="5"> The &quot;Disease&quot; cluster is axiomatized primarily in terms of a temporal schema of the progress of an infection. The cluster of &quot;Medical Practice&quot;, or medical intervention in the natural course of the disease, can be axiomatized as a plan, in the AI sense, for maintaining or achieving a state of health in the patient, where different branches of the plan correspond to where in the temporal schema for disease the physician intervenes and to the mode of intervention.</Paragraph> <Paragraph position="6"> Most of the content of the domain-specific cluster.~ is specific to medicine, but the general principles along which it was constructed are relevant to many applications. Frequently the best way to proceed is first to identify the entities and classification schemes in several clusters, state the relationships among the entities, and encode axioms articulating clusters with higher- and lower-level clusters. Often one then wants to specify temporal schemas involving interactions of entities from several domains and goal-directed intervention in the natural course of these schemas.</Paragraph> <Paragraph position="7"> The concordance method of the second stage is quite useful in ferreting out the relevant facts, but it leaves some lacunae, or gaps, that become apparent when we look at the knowledge base as a whole. The gaps are especially frequent in commonsense knowledge. The general pl'inciple we follow in encoding this lowest level of the knowledge base is to aim for a vocabulary of predicates that is minimally adequate for expressing the higher-level, medical facts and to encode the obvious connections among them. One heuristic has proved useful: If the axioms in higher-level domains are especially complicated to express, this indicates that some underlying domain has not been sufficienlly explicated and axiomatized. For example, this consideration h:~s h~d to a fuller elaboration of the &quot;systems&quot; domain. Another example concerns the predicates '~parenteral&quot;, &quot;needle&quot; and &quot;bite&quot;, appearing in the domain of &quot;disease transmission&quot;. Initial attempts to axiomatize them in4icated the need for axioms, in the &quot;naive topology&quot; domain, about m(unbranes and the penetration of membranes allowing substances to move from one side of the membrane to the other.</Paragraph> <Paragraph position="8"> Within each cluster, concepts and facts seem Ic, fall into small groups that need to be defined together. I%r cxample.</Paragraph> <Paragraph position="9"> the predicates &quot;clean&quot; and &quot;contaminate&quot; need to be defined in tandem. There is a larger example in the &quot;Disease Transmission&quot; cluster. The predicate &quot;transmit&quot; is fundamental, and once it has been characterized ,~s the motion of an infectious agent from a person or animal to a person via some medium, the predicates &quot;source&quot;, &quot;route&quot;, &quot;mechanism&quot;, &quot;mode&quot;, &quot;vehicle&quot; and &quot;expose&quot; can be de, fined in terms of its schema. In addition, relevant facts about body fluids, food, water, contamination, needles, bites, propagation, and epidemiology rest on an under~tanding of &quot;transmit&quot;. In each domain there tends to be a core of central * predicates whose nature must be explicated with some care.</Paragraph> <Paragraph position="10"> A large number of other predicates can then be characterized fairly easily in terms of these.</Paragraph> </Section> class="xml-element"></Paper>