File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/e87-1039_metho.xml
Size: 24,324 bytes
Last Modified: 2025-10-06 14:12:00
<?xml version="1.0" standalone="yes"?> <Paper uid="E87-1039"> <Title>Acquisition of Conceptual Data Models from Natural Language Descriptions</Title> <Section position="3" start_page="241" end_page="244" type="metho"> <SectionTitle> Conceptual Data Models </SectionTitle> <Paragraph position="0"> Within system development methods, there exist several notations for the representation of aspects of the information processing systems. A comprehensive method will be able to represent a conceptual model of the real and abstract objects in the system's environment, the functional requirements of the system, and its detailed design and implementation. Methods differ according to their emphasis on the data the system deals with or its processing requirements. Here, we will only discuss data-oriented approaches, and in particular Entity-Relationship (ER) modelling(Martin 1984, MacDonald 1986), and the Information Structure Diagram within NIAM (Verheijen and van Bekkum, 1982; Blank and Krijger, 1982), an approach to conceptual modelling that claims to be directly informed by considerations of the structure of natural language sentences.</Paragraph> <Paragraph position="1"> Both approaches started as paper-and-pencil notations for unambiguously recording the systems analysfs understanding of the relationships between objects that are significant to the proposed computer system. Now, however, both feature in interactive computer-based tools. ER analysis is the cornerstone of several application development toolkits, including the Information Engineering Workbench, which uses such models as input to a process that ultimately leads to automatic code generation. Using this software, the analyst creates a model of data objects and relationships, and then specifies how programs will access this data by superimposing access paths and conditions on the ER diagram to produce a 'data navigation diagram' which can in turn be the input to an automatic process of computer program generation.</Paragraph> <Paragraph position="2"> The NIAM approach is somewhat different. It aims to avoid altogether the need to describe the objectives of a computer program prvcedurally. The NIAM conception is that if all the relationships and dependencies among data objects are specified rigorously, all events in the world cart be recorded in the database by the assertion or retraction of 'sentences' from the information base (database). Further, rules for deriving information, such as that produced in reports, can be included in the conceptual model. Under this approach, the need for application programs is obviated, since 'all computations may be executed under the control of the information base handler instead of an application program' (Blank and Krijger 1982, p 140). NIAM thus belongs to the 'executable specifications' class of approaches to system design. To illuslxate NIAM's modelling language, we will use example sentences from a narrative in a database examination paper (1) and (2).</Paragraph> <Paragraph position="3"> (1) Customers send the company purchase orders for pharmaceutical suppfies.</Paragraph> <Paragraph position="4"> (2) Each order contains requests for quantities of many different products which are all required for one shop.</Paragraph> <Section position="1" start_page="241" end_page="243" type="sub_section"> <SectionTitle> 2.1 NIAM Information Structure Diagrams </SectionTitle> <Paragraph position="0"> In this section, we describe the modelling constructs of the NIAM information structure diagrwn (ISD). The derivation of the model from natural language text is taken up in section 6.</Paragraph> <Paragraph position="1"> The NIAM method uses similar constructs to ERA modelling, (employing different terminology) but at a more atornistic level. Its perspective is derived from the structures of natural language, on the justification that a database comprises a set of sentences and the purpose of the conceptual darn model is to specify a grammar of the sentences allowed in a particular database (Blank and Krijger, 1982). Relationships between objects have associated with them two role names, one for each related object.</Paragraph> <Paragraph position="2"> Figure 1 shows how (1) and (2), together with some further information about warehouses and picking lists, can be represented in NIAM notation~ .as .. \[for .</Paragraph> <Paragraph position="3"> Figure 1 - NIAM Infornmtion Structurc Diagram NIAM representations are constructed from objects and relationships.</Paragraph> <Paragraph position="4"> Objects are of two kinds: NOLOTs or NOn-Lexical Object Types are concrete or abstract objects of reality, and LOTs or Lexieal Object Types are objects of which occurrences have values, i.e. they are names. NOLOTs and LOTs are synonyms for entities and attributes, and are shown by unbroken and broken circles respectively.</Paragraph> <Paragraph position="5"> Relationships are associations between objects: either between two NOLOTs or between a NOLOT and a LOT. They are shown by lines connecting two circles. Semantically, the relationship is the Cartesian product of the two related object types. On the line are two rectangular boxes bearing the name of the role each object plays in the relationship. The concept of a role is similar to that of a case role in linguistics (cf Fillmore 1968).</Paragraph> <Paragraph position="6"> In addition to portraying objects and their relationships, NIAM also explicitly represents constraints between these objects.</Paragraph> <Paragraph position="7"> Relationship degree is shown by double headed arrows beside one or both roles. The relationship between customer and purchase order in Figure 1 is a one to many (I:N) relationship in that a customer may send many purchase orders, but each order is only sent by one customer. A many to many (M:N) relationship, such as that between purchase orders and pharmaceutical supplies, is shown by the arrow's spanning both roles. A one to one (1:1) relationship has separate arrows alongside each role. Whether a relationship is obligatory or not is also shown in the diagram. The 'V' across the line between purchase order and the 'sent-by' role indicates that a purchase order cannot exist without being related in this way to a customer, but the absence of such a symbol at the opposite end of the relationship line shows that a customer can exist without having any (current) purchase orders.</Paragraph> <Paragraph position="8"> Additional constraints, such as subset and set inequality constraints between objects, relationships or roles can also be modelled on the NIAM ISD. For example, the arrow linking pharmaceutical supplies to product indicates a subset relationship.</Paragraph> <Paragraph position="9"> Often, M:N relationships are indicative that further analysis is required. Where such a relationship conveys genuine information, it is usually helpful to resolve the relationship into two I:N relationships, with a new entity type between.</Paragraph> <Paragraph position="10"> The M:N relationship in figure 1 between purchase order and pharmaceutical supplies was derived from sentence (1). In sentence (2), further information about orders was supplied. All the information conveyed by the M:N relationship is represented by the chain of I:N relationships linking purchase order, request, product and pharmaceutical supplies.</Paragraph> <Paragraph position="11"> Tools for Conceptual Modelling Many proprietary tools exist for editing conceptual data models, e.g. Excellerator, Information Engineering Workbench, and Blues. The system enables the user to draw diagrams using a mouse input device. The user selects from the symbols in the notation by clicking the mouse button, moving the cursor to a desired location and clicking again. Lines connecting symbols can be selected in the same way and placed by clicking twice, to indicate the two symbols the line connects.</Paragraph> <Paragraph position="12"> Violations of the 'syntax' of the notation are policed by the software.</Paragraph> <Paragraph position="13"> Modifications to both the content and layout of a diagram can be made by cutting and pasting. Annotating components with their names and other attributes is done by clicking on existing symbols to open a dialogue window.</Paragraph> <Paragraph position="14"> As the diagram is thus created and edited, the information expressed in it is stored in a data dictionary (or 'encyclopaedia').</Paragraph> <Paragraph position="15"> It can be argued that such an interface is so user-fi'iendly that no case could be made for a natural language alternative. However, it is emphasised that a tool as described above is entirely passive. It simply records the information fed into it, and can give no guidance as to the correct way to represent a given state of affairs. It can only be used by an expert in the method of analysis it documents. For such an expert, it is probably an optimal tool. However, we have noted in section I that owing to the babel of alternative notations, there are circumstances in which experienced analysts are required to use methods they are not familiar with. This is the premise of the AMADEUS project (Loucopoulos et al 1986, Black et a/ 1987) which seeks to provide a facility for translating between alternative method notations. Briefly, the requirement to use unfamiliar methods can arise because of job mobility, organizational take-overs, customers dictating the method to be used by those who tender for their contracts, and in the course of training.</Paragraph> <Paragraph position="16"> It is also envisaged that the system will be used by end users to develop applications without professional support. Figure 1 illustrates, by the variety of special symbols used and their connectivity, that for end-user application development, notations like the N/AM ISD would require an explanation facility to support comprehension, For a non-expert to use such a notation constructively to devel6p a specification also requires some form of expert assistance. A f'mal motivation for building the system is that as an integrated natural language and graphics interface, it provides a context in which the relative merits of the two interface styles can be compaxed. As Thompson (1983) has noted, almost no empirical work has ever been carried out into the relative merit of natural language and graphic interfaces.</Paragraph> </Section> <Section position="2" start_page="243" end_page="243" type="sub_section"> <SectionTitle> 4.1 Dialogue Structure </SectionTitle> <Paragraph position="0"> A natural interface using both text and graphics requires a large bit-mapped screen and both keyboard and pointing input devices. An Apollo DN3000 running Quintus Prolog under UNIX has been selected as an environment for development of the system. The intended dialogue structure employs two windows, one for text and one for graphics.</Paragraph> <Paragraph position="1"> In both cases, highlighting is used for attention focussing and establishing correspondence between a diagram and natural language narrative.</Paragraph> <Paragraph position="2"> Text to graphics. Appendix A shows an hypothetical dialogue where the user input is in the text window.</Paragraph> <Paragraph position="3"> This dialogue owes much to the style of dialogue employed in Nanok/aus (Haas and Hendrix 1983), and would suit a very inexperienced or casual user. Someone more used to expressing rules in unambiguous English might be able to say most of the above in one sentence: &quot;A paper is written by one or more authors, one of which must also be its presenter, and any of whom may be the authors of other papers.&quot; For this reason, the interface must have good syntactic coverage and a formal semantic component that deals with quantifier seeping.</Paragraph> <Paragraph position="4"> Graphics to text. A dialogue where the input takes place in the graphics window proceeds as follows: The user selects and places new symbols in the graphics window.</Paragraph> <Paragraph position="5"> For each symbol added, the change is recorded in the session database, and its internal representation is passed to the language generation component, which produces an English description of the effect of the changes. Suppose for example, that the graphics window contains the first drawing shown in Appendix A. The user then adds the V symbol to produce the next drawing shown. In response, the following text is produced: &quot;A paper must be written_by at least one author.</Paragraph> <Paragraph position="6"> (Previously it could apparently exist without being written by an author.)&quot;.</Paragraph> <Paragraph position="7"> Alternative uses of the natural language generation facility exist. For example, a user could highlight a part of the diagram and request a translation into English, or could enter changes in a &quot;what if&quot; mode and have their consequences explained.</Paragraph> </Section> <Section position="3" start_page="243" end_page="244" type="sub_section"> <SectionTitle> 4.2 System structure </SectionTitle> <Paragraph position="0"> To produce a dialogue such as that shown in Appendix A or as described above, a system organization such as that shown in figure 2 is required.</Paragraph> <Paragraph position="1"> Both user interfaces must use the same internal representation for the aspects of systems described alternatively in text or graphics. This is discussed below. The session/specification database is the counterpart of the data dictionary in individual proprietary tools. In such a system, the graphics interface is such an integral part of the system that it along with the natural language interface requires to be re-implemented.</Paragraph> </Section> </Section> <Section position="4" start_page="244" end_page="246" type="metho"> <SectionTitle> 5 Knowledge representation framework </SectionTitle> <Paragraph position="0"> It has been established in the separate AMADEUS project (Black et a/ 1987) that a flame representation based on FRL (Roberts and Goldstein, 1977) is capable of representing all the modelling constructs used in a range of requirements specification notations. SpecificaLly, in the case of NIAM, objects (lexical and non-lexical) and relationships are represented by frames, and roles by slots. Constraints of relationship degree and optionality are represented together by facets of role slots.</Paragraph> <Paragraph position="1"> As an example, Figure 3 shows a set of frames representing some of the information about paper authorship shown in Appendix A.</Paragraph> <Paragraph position="2"> It is intended that a uniform knowledge representation structure such as that shown in Figure 3 will be used throughout the system, both for storing the facts gathered in a session, and for representing the stored knowledge in the system, including the dictionary.</Paragraph> <Paragraph position="4"> Haas and Hendrix (1983) describe a system where a semantic network model of object classes, instances and properties is constructed through a co-operative natural language dialogue. In the early version, Nanoldaus, the syntactic coverage is restricted to simple sentences in which the user may assert propositions about the set membership and other properties of objects.</Paragraph> <Paragraph position="5"> (Enomoto et al 1984) describes a system in which an unambiguous fragment of English (based on Montague's PTQ) cart be used in a highly constrained way to describe the desired behaviour of a system.</Paragraph> <Paragraph position="6"> Other work on natural language understanding of descriptive text has tended to use ad-hoc semantic grammars specialized to the application domain. Norton (1982) describes a program that acquires knowledge of the BASIC programming language's syntax and semantics from a textbook and uses this to generate an interpreter for part of the language. In some respects, the goals are similar to our own, but the semantic grammar approach used means that little of that apporach is re-usable.</Paragraph> <Paragraph position="7"> Less directly related to the system specification domain is (Mellish 1985) which describes a system for the semantic interpretation of mechanics problems expressed in English. The program made use of the given/new distinction in establishing the co-reference of definite and indefinite descriptions, incrementally constructing extensional semantic interpretations using intermediate intensional reference entities.</Paragraph> <Paragraph position="8"> Earlier work on text comprehension (e.g. de Jong, 1979) concentrated on skimming techniques to match text content against sketchy scripts. Such a grain of analysis is inappropriate for present purposes.</Paragraph> <Section position="1" start_page="244" end_page="246" type="sub_section"> <SectionTitle> 6.1 Conceptual Modelling from NL Text. </SectionTitle> <Paragraph position="0"> The goal of conceptual modelling is to identify the significant objects and relationships in the application universe of discourse. As with other NLU tasks, this requires knowledge of three sorts: syntax, semantics and real-world knowledge. In this section, we discuss the separate contribution each source of knowledge makes in conceptual modelling.</Paragraph> <Paragraph position="1"> Syntax. Martin (1984) has observed that there is a simple mapping of surface syntactic categories onto the components of ER modelling. Nouns correspond to entities (objects), and verbs correspond to relationships (or in the case of NIAM, with role names). On this basis, sentences (3) and (4) would receive different analyses, as shown below.</Paragraph> <Paragraph position="2"> (3) (4) Customers send orders for products.</Paragraph> <Paragraph position="3"> Customers order products.</Paragraph> <Paragraph position="4"> The English description in (2) is much less directly helpful in identifying relationships. The attachment of the relative clause which are all required for one shop to order rather than product, request or quant/ty cannot be decided on purely syntactic grounds. Fmxher, that quam/ty is an attribute of request rather than an entity in its own right cannot be determined without extra-linguistic knowledge. The requirements for a linguistic approach are that either is is constructed in the same manner as Nanoklaus, to employ simple input phrase structures, but embedded in a .cooperative dialogue, or else it should have sufficient linguistic coverage to handle the complex sentence structures exhibited in (1) and (2). Most importantly in the latter respect, it should have a reasonable treatment of the variety of natural language quantifiers and relative clauses. Many database interfaces have such capabilities, McCord (1982), Dahl (1982) and Warren and Pereira (1982) inter a//a.</Paragraph> <Paragraph position="5"> Semantics. Chamiak (1983) makes a distinction between inferential and non-inferential semantics. The former is concerned with establishing the logical form corresponding to a syntactic analysis of a sentence, whereas the latter is concerned with co-oecurrence restrictions between phrases which may be stated in terms of lexical subcategories such as human, mass, machine, etc.</Paragraph> <Paragraph position="6"> Database interfaces are the most common instances of complete natural language interfaces which comprise beth syntactic and semantic components. As such they are potential models for the development of interfaces to new types of software systems. However, their approach to semantics cannot be imported wholesale. They avoid the general theoretical problem of what a semantics of natural language should consist of by an operational approach in which the propositional content of a sentence is represented by a database tuple, and lexical subcategorization is implemented in application-specific categories. The following dictionary entries, for 'order' both as a noun and a verb have been encoded in the notation used by (McCord 1982).</Paragraph> <Paragraph position="7"> \[obj:Prod:goods,npobj(from):Supp:prsn\]).</Paragraph> <Paragraph position="8"> Each of these dictionary entries has five components. The first is the name of the word, the second is the propositional meaning, the third a variable denoting time, the fourth specifies the semantic subeategorization of the word (in the case of nouns) or its subject (in the case of verbs), and the last subeategorizes the objects or other postmodifiers the word may take.</Paragraph> <Paragraph position="9"> One danger with application-specific lexical subcategorization is that it may be applied too restrictively. For example, in the lexicon published in (McCord 1982), subclasses are specifically restricted to the database entities that can be expected in a query. For example, the semantics of take are specified to expect a student as subject and a course as an object. Such restrictions are fine for database queries, such as (5) but a question such as (6) cannot even be asked.</Paragraph> <Paragraph position="10"> Do lecturers ever take courses? Real world knowledge. It is not possible to produce an analysis such as that shown in Figure 1 without 'realworld' knowledge in addition to a grammar and dictionary. For example, the knowledge that pharmaceutical supplies are a subset of products is required to link the information acquired from the analyses of (1) and (2). The full extent to which real-world knowledge will be required in the system is not known, but it is assumed that the sort of notation shown in Figure 3 can be employed to encode arbitrary real-world knowledge for the system.</Paragraph> <Paragraph position="11"> The boundary between what is linguistic knowledge and what is real-world knowledge is not a clear one. In the sample dictionary entries for order, we have shown that corresponding to an order, there is also an item. This was necessary so that the type of object can be linked to an argument place in the predication. It can be argued that this amounts to non-linguistic knowledge that orders typically comprise several distinct items.</Paragraph> <Paragraph position="12"> Adapting a database Interface. An initial prototype system for inferring the existence of entities and relationships from natural language descriptions is being constructed using McCord's Slot Grammar (McCord 1982), selected for its syntactic coverage and trealanent of a variety of natural language quantifiers.</Paragraph> <Paragraph position="13"> To adapt the form of lexieal entries in the McCord parser from the database query task to the present one, generic definitions of word meanings have been provided, allowing a wider range of assertions to be made.</Paragraph> <Paragraph position="14"> Results. With these defindons it has been possible in a rudimentary way to determine the existence of some relationship types between entities to build simple ER models. This is done by examining the attributes of the relational database predicates in the parse tree. The existence of a relationship between two database relations, is indicated by the sharing of attributes. If the identifier of one relation occurs as a non-identifying attribute in another relation, we may infer a I:N relationship between them. For example, in the following parse of the sentence &quot;enstorners order products&quot; the variable _133 is common to both order and customer:.</Paragraph> <Paragraph position="15"> This occurs precisely because the dictionary entry for &quot;order&quot; explicitly provided for the identier of the subject to be an argument of the predication.) The sharing of the arguments tells us that a relationship exists between the entity order and the entity customer, and furthermore, it is a I:N relationship from customer to order, since the shared argument is the whole key of customer, and either a non-key or part key in order.</Paragraph> <Paragraph position="16"> Current status of project. The prototyping activity described above is ongoing, but in parallel, the overall design is being elaborated, and a purpose-built parser based on LFG is being implemented in Prolog. Work on the generation component has not yet commenced.</Paragraph> </Section> </Section> class="xml-element"></Paper>