File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1020_metho.xml
Size: 20,572 bytes
Last Modified: 2025-10-06 14:11:29
<?xml version="1.0" standalone="yes"?> <Paper uid="A83-1020"> <Title>AUTOMATIC ANALYSIS OF DESCRIPTIVE TEXTS</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> II BACKGROUND </SectionTitle> <Paragraph position="0"> Many research workers are interested in different aspects of text analysis. Much of the emphasis of this work depends on the use of sophisticated grammars to map to the internal representation. The work done by Schank (197~) and that of Sager (1981) are two contrasting examples of this interest. In addition to the research oriented work, some commercial groups are interested in the practicability of generating database input from text.</Paragraph> <Paragraph position="1"> its properties in a slnEle piece of text. The basic properties we are lookin 8 for - shape, colour, size - are all described by words wi~h a direct physical relation or with a simple men~al association. What we are really trying to do is tidy the description into a set of suitable noun phrases.</Paragraph> <Paragraph position="2"> Although the internal details of the various systems are totally different the final result is some form of layout, script or structure which has been filled out with details from the text. The approach of the various groups can be contrasted according to how much of the text is preserved at this point and how much additional detail has been added by the system. DeJong (1979) processes newswire stories and once the key elements have been found the rest of the text is abandoned. Sager makes the whole text fit into the layout as here small details ~ay be of vital importance to the end user of the processed text. Schank in his story understanding programs may actually end up with more information than the original text, supplied from the system's own world knowledge.</Paragraph> <Paragraph position="3"> The other contrasting factor is the degree of limitation of the domain of interest of the text processors. The more a system has been designed with a practical end in view, the more limited the domain. Schank is operatlng at the level of general language understanding. DeJong is limiting this to the task of news recognition and abstraction, but only certain stories are handled by the system. Sager has reduced the range still further to a particular type of medical diagnoses.</Paragraph> <Paragraph position="4"> Very recent work appears to be approaching text understanding from a word oriented viewpoint. Each word has associated with it processes which drive the analysis of the text (S~ail, 198\[). We have also been encouraged in our own approach by Kelly and Stone's (1979) work on word dlsamblguation. The implication of which seems to be that word driven rules can resolve ambiguities of meaning in a local context.</Paragraph> <Paragraph position="5"> Our own case is a purely practical attempt to generate large amounts of database building information from single topic texts. It should not be assumed however that a truly comprehensive syntax for a descriptive text would be simpler than for other types. The reverse may be true and the author of the descriptions may attempt to liven up his work with asides, unusual wordorders and additional atmospheric details.</Paragraph> <Paragraph position="6"> Our system does not use sophisticated grammarital techniques. It is our contention that in the domain of descriptive texts we can make certain assumptions about the way the descriptive data is handled. These allow very crude parsing to be sufficient in most cases.</Paragraph> <Paragraph position="7"> Similarly the semantic structures involved are simple. A description of an object consisting of several parts usually mentions the part and</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> III OUTLINE OF TME SYSTEM </SectionTitle> <Paragraph position="0"> The text analysis system has been constructed on the assumption that much of the information held in descriptive texts can be extracted using very simple rules. These rules are analogous to the &quot;sketchy syntax&quot; suggested by Kelly and Stone and operate on the text on a local rather than a global basis.</Paragraph> <Paragraph position="1"> At the time of writing our system processes plant descriptions, in search of ten properties which we consider distinctive. Examples of these properties are the size of the plant, the colour of its flowers and the shape of its flowers. New properties can be added simply by extending the skeleton plant description.</Paragraph> <Paragraph position="2"> Example I. A Sample Analysis SMALL BUGLOS$.</Paragraph> <Paragraph position="3"> An erect bristly annual, up to a fooC high, with wavy lanceolate leaves and small blue flowers which are the only ones of their family to have their corolla-tube kinked at the base; calyx with lanceolate teeth, hardly enlarging but much exceeding the fruit. Rabitat: Widespread and locally frequent in open spaces on light soils.</Paragraph> <Paragraph position="4"> April onwards.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> TOPIC COMPONENT PROPERTY PROPERTY PARTS NAMES VALUES </SectionTitle> <Paragraph position="0"> The texts being processed are plant descriptions as found in McCllntock and Fitter (1974). The system has been built to handle this topic and it attempts to fill out various properties for selected parts of a plant. A skeleton description is used to drive the processing of the text. This indicates the parts of the plant of interest and the properties required for each part.</Paragraph> <Paragraph position="1"> The structure which we presently use is shown in Example I after it has been filled out by processing the accompanying description. It should be noted that if the system cannot find a property then the null property &quot;nolnfo' is returned. null An outline of how a description is processed by the system and converted to canonical form is given in Figure i. There are four distinct stages in the transformation of the text. each with an attached keyword. This keyword Indentlfles the text as describing a particular part of the plant. Text segments are gathered together for a particular keyword. This may pull together text from separate parts of the original description This new unit of text is then examined to see if any of the words or phrases in it satisfy the specific property rules required for this part of the plant. If found the phrases are inserted into appropriate parts of the structure. D. Formatter.</Paragraph> <Paragraph position="2"> The ultimate output of the system is intended as input to a relational database system developed at the University of Strathclyde. At the moment the structure is displayed in a form that allows checking of the system performance. A. Dictionary processor.</Paragraph> <Paragraph position="3"> The raw text is read in and each word in the text is checked in a dlctionary/keyword llst. Each dictionary entry has an associated list of attributes describing both syntactic and semantic attributes of that word. These attributes are looked at in more detail in section IV. If a word in the text appears in the dictionary it is supplemented with an attribute llst abstracted from the dictionary.</Paragraph> <Paragraph position="4"> The keywords for a text depend on which parts of the object we are interested. Thus for a plant we need to include all possible variants of flower (floret, bud) and of leaf (leaflet) and so on. Fortunately this is not a large number of words and they can be easily acquired from a thesaurus.</Paragraph> <Paragraph position="5"> The output from this stage is a llst of words and attached to each word is a llst of the attributes of this word.</Paragraph> <Paragraph position="6"> 8. Text splitting.</Paragraph> <Paragraph position="7"> The expanded text ks then burst into segments associated with each keyword. We identify segments by using &quot;pivotal points&quot; in the text. Pivotal points are pronouns, conjuntlons, prepositions and punctuation marks. This is the simplifying assumption which we make which allows us to avoid detailed grammars. The actual words and punctuation marks chosen to split the text are critical to the success of this method. It may be necessary to change these for texts by a different author as each author's usage of punctuation is fairly Idiosynchratic. Within a given work however fairly consistent results are obtained. The actual splitting of the text is covered more fully An section IV C.</Paragraph> <Paragraph position="8"> C. Text analysis.</Paragraph> <Paragraph position="9"> We now have many small segments of text</Paragraph> </Section> <Section position="7" start_page="0" end_page="121" type="metho"> <SectionTitle> IV SYSTEM DETAILS A. ThE Dictionary </SectionTitle> <Paragraph position="0"> The dictionary is the source of the meanings of words used during the search for properties. Two other word sources are incorporated in the system, a llst of keywords which is specific to the subject being described and a list of words which may be used to split the text. This second list could probably be incorporated in the dictionary, but we have avoided this until the system has been generallsed to handle other types of text.</Paragraph> <Paragraph position="1"> The dictionary entry for each word consists of three lists of attributes. The first contains it's part of speech, a flag indicating the word carries no semantic information and some additional attributes to control processing. For example the attribute &quot;take-next&quot; indicates that if a property rule is already satisfied when this word is reached in the text then the next word should be attached to the property phrase already found. Thus the word &quot;-&quot; carries this property and pulls in a successive word.</Paragraph> <Paragraph position="2"> The second llst contains attributes whose meaning would appear to be expressible as a physical measure of some kind:- &quot;touch-roughness&quot;, &quot;vision-intenslty&quot;. Many of the words used in descriptions can be adequately categorised by a single attribute of this type. Thus the word red is an &quot;adjective&quot; with a physical property &quot;vlslon-colour&quot;.</Paragraph> <Paragraph position="3"> The third contains those which require physical measures to be mapped and compared to internal representations or which deal with the manipulation of internal representations alone:&quot;form-shape&quot;, &quot;context-location&quot;. Words using these attributes generally tend to be more complex and may have multiple attributes. Thus the word field has as attributes &quot;context-location&quot; and &quot;relaclonshlp-multlple-example&quot; whereas the word Scotland also carries &quot;context-location&quot; but is qualified by &quot;relatlonship-single-example&quot;. We realize this cLtvisPSon is delimited by an extremely fuzzy border, but when the search for a basis for word definition was made chls helped the intuitive allocation of attributes. Sixty five different attributes have been allocated.</Paragraph> <Paragraph position="4"> Only sixteen of these are used in the rules for our current list of properties.</Paragraph> <Paragraph position="5"> The size of the dictionary has been considerably reduced by including the algorithm, given by Kelly and Stone (1979), for suffix removal in the lookup process.</Paragraph> <Paragraph position="6"> B. Skeleton Structure The structure we wish to fill ouC is mapped directly to a hierarchical PROLOG structure with the uninstantiated variables, shown in the structure in capital letters, indicating where pieces of text are required. The PROLOC system fills in these variables at run time with the appropriate words from the text. Each variable in a completed structure should hold a llst of words which describe that particular property. Thus a partial plant structure is defined as: null This skeleton is accompanied by a set of keyword lists. Each llst being associated with one of the first levels of the structure. Thus a partial I/st for *flower&quot; ~/ght be:keyword(flower,l). null keyword(bud,l).</Paragraph> <Paragraph position="7"> keyword(pecal,l).</Paragraph> <Paragraph position="8"> keyword(floret,l).</Paragraph> <Paragraph position="9"> The number indicates which item on the first level of the structure is associated with these keywords.</Paragraph> <Paragraph position="10"> another, We assume initially that we are describing the general details of the plant, so the text read up to the first pivotal poin~ belongs to that part of our structure, keyword level O.</Paragraph> <Paragraph position="11"> Each subsequent piece of text found assigns to the same keyword until a piece of text is found containing * new keyword. This becomes the current keyword and following pieces of cex~ belong to this kayword until yec another keyword is found.</Paragraph> <Paragraph position="12"> D. Propert 7 gules We now gather together the pieces of text for a part of the structure and look for properties as defined An the skeleton structure. A property search is carried out for each of the property names found at level two of the strutcure. The property rules have the general form: null The fundamental assumption we make for descriptions of objects is that the part described will be mentioned within the piece of cexc referring to ic. Thus conjunctions and punctuation marks are taken to flag pivotal points lo the text where attention shifts from one part to E. Special Purpose Rules We are trying to avoid rules specifically associated with layout which would need redeflnltion for different texts. However the system does assume a certain ordering in the initial title of the descriptions. Thus the name of the plant is any adjectives followed by a word or words not in the dictionary. It is intended to add rules to detect the Latin specific name of the plant. We have excluded these from our current texts.</Paragraph> <Paragraph position="13"> These will in all probability be based on a similar rule of ignorance, reinforced by some knowledge of permissible suffices.</Paragraph> <Paragraph position="14"> F. Specially Recosnised Words Certain words are identified in the dlctlonary by the attributes &quot;take-next&quot; and &quot;takeprevious&quot;. They imply that if a property rule is satisfied at the time that word is processed then the successor or predecessor of that word and the word itself should be included in the property. The principal use of this occurs in hyphenated words. These are treated as three words; wordl, hyphen, word2. The hyphen carries both &quot;take-next&quot; and &quot;take-previous&quot; attributes. This often allows attachment of unknown words in a property phrase. Thus &quot;chocolate-brown&quot; would be recognlsed as a colour phrase despite the fact that the word chocolate is not included in the dictionary. null Words which actually name the property being sought after carry a &quot;take-previous&quot; a~tribute. Thus &quot;coloured&quot; when found will pull in the previous word e.g. &quot;butter colour&quot; although the word butter may be unknown or have no specific dictionary attribute recognised by the rule.</Paragraph> <Paragraph position="15"> particular, we intend to provide a user interface to allow the system to be modified for a specific topic by user definitions and examples.</Paragraph> <Paragraph position="16"> The potential also exists for mapping from our word based internal representation to a more abstract machine manipulable form. This may be the most interesting direction in which the work will lead.</Paragraph> </Section> <Section position="8" start_page="121" end_page="121" type="metho"> <SectionTitle> Y=I I~LEMENTAT ION </SectionTitle> <Paragraph position="0"> The code for the system is written in PROLOG (Clocksln and Mellish, 1981) as implemented on the Edinburgh Multi Access System (Byrd,1981).</Paragraph> <Paragraph position="1"> This is a standard implementation of the language, with the single enhancement of a second internal database which is accessed using a hashing algorithm rather than a linear search. This has been used to improve the efficiency of the dictionary search procedures.</Paragraph> <Paragraph position="2"> PROLOG was chosen as an implementation language mainly because of the ease of manipulation of structures, lists and rules. The skeleton plant and keyword lists are held as facts in the PROLOG database. The implementation of the suffix stripping algorithm is a good example of the ease of expressing algorithms in PROLOG. The mapping from the original to our code being almost one to one.</Paragraph> </Section> <Section position="9" start_page="121" end_page="121" type="metho"> <SectionTitle> V FURTHER DEVELOPMENTS </SectionTitle> <Paragraph position="0"> In the short term, the size of the dictionary and the rules built into the system must be increased so that a higher proportion of descriptions are correctly processed. Another problem which we must handle is the use of qualifiers referring to previous descriptions e.g. 'darker green&quot; or &quot;much less hairy than the last species'. We intend to tackle this problem by merging the current canonical description with that of plants referred to previously It would appear from work that has been carried out on dictionary analysis (Amsler, 1981) that a less intuitive method of word meaning categorization may be available. If it proves possible to ~ap from a standard dictionary to our set of attributes or some related set then the rigour of out internal dictionary would be significantly improved and a major area of repetitive work might be removed from the system.</Paragraph> <Paragraph position="1"> It is also intended to extend the suffix algorithm to handle prefixes and to convert the part of speech attribute according to the transformations carried out on the word. This has not proved important to us up to the present but future uses of the dictionary may depend on its being handled correctly.</Paragraph> <Paragraph position="2"> In the longer term we intend to generallse the system to cope with other topic areas. In In addition the implementation on EMAS allows large PROLOG programs to be run. The interpretive nature of the language also means that trace debugging facilities are available and new pieces of code can be easily incorporated into the system.</Paragraph> </Section> <Section position="10" start_page="121" end_page="122" type="metho"> <SectionTitle> Vll CONCLUSIONS </SectionTitle> <Paragraph position="0"> Initial indications suggest that for about 50% of descriptions, all ten properties are correctly evaluated and for about 30%, 8 or 9 properties are correct. The remaining 20% are unacceptable as less than 8 properties are correctly determined by the system.</Paragraph> <Paragraph position="1"> We anticipate that increasing the knowledge base of the system will significantly increase its accuracy. null The very primitive &quot;sketchy syntax&quot; approach appears to offer practical solutions in analysing descriptive texts. Furthermore, there seems to be no intrinsic reason why a similar method could not be used to analyse temporal or causal structures. There will always be segments of text that the system cannot cope with and to achieve a greater degree of accuracy we will need to allow the system to consult with the user in resolving difficult pieces of text.</Paragraph> <Paragraph position="2"> The structured nature of the system output allows the possibility of building a complex database system. A dace base system based on the raw text alone has no ability to dlsclngulsh to which part of an object any property belongs as its searches are made on the basis of keywords alone without caking contextual information into accott~t.</Paragraph> </Section> <Section position="11" start_page="122" end_page="122" type="metho"> <SectionTitle> VIII ACKNOWLEDGEMENTS </SectionTitle> <Paragraph position="0"> I would llke co thank the director of the Computer Centre Mr. Grant Fraser for making available time co carry ouc thls work and my supervlsor Dr. fan $o--.-rvllle for his help in the development of the system and in the writing of thls paper.</Paragraph> </Section> class="xml-element"></Paper>