File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/a83-1016_metho.xml
Size: 12,475 bytes
Last Modified: 2025-10-06 14:11:30
<?xml version="1.0" standalone="yes"?> <Paper uid="A83-1016"> <Title>UTILIZING DOMAIN-SPECIFIC INFORMATION FOR PROCESSING COMPACT TE~T</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> UTILIZING DOMAIN-SPECIFIC INFORMATION FOR PROCESSING COMPACT TE~T </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> This paper Identifies the types of sentence fragments found In the text of two domains: medical records and Navy equipment status messages. The fragment types are related to full sentence forms on the basis of the elements which were regularly deleted. A breakdown of the fragment types and their distributions In the two domains Is presented. An approach to reconstructing the semantic class of deleted elements In the medical records Is proposed which is based on the semantic patterns recognized In the domain.</Paragraph> </Section> <Section position="3" start_page="0" end_page="99" type="metho"> <SectionTitle> I. INTRODUCTION </SectionTitle> <Paragraph position="0"> A large amount of natural language Input, whether to text processing or questlon-answerlng systems, conslsts of shortened sentence forms, sentence nfragmentsn. Sentence fragments are found in informal technical communlcatlons, messages, headlines, and In telegraphic camunlcatlons.</Paragraph> <Paragraph position="1"> Occurrences are characterized by thelr brev lty and Informational nature. In all of these, if people are not restricted to using complete, grammatical sentences, as they are In formal writing situations, they tend to leave OUt the parts of the sentence which they belleve the reader will be able to reconstruct. Thls is especially true if the writer deals wlth a specialized subject matter where the facts are to be used by others in the same field.</Paragraph> <Paragraph position="2"> Several approaches to such hill-formed,, natural language Input have been followed. The LIFER system \[Hendrlx, 1977; Hendrlx, et ai., 1978\] and the I~_ANES system \[Waltz, 1978\] both account for fragments In procedural terms; they Co not require the user to enumerate the types of fragments which will be accepted. The Linguistic Strlng Project has characterlzed the regularly occurring ungrammatical constructions and made them pert of the parsing grammar \[Anderson, et el., 1975; Hlrschman and Sager, 1982\]. Kwasny and Sondhe!mer (10R1) have used error-handling procedures to relate the Ill-formed input of sentence fragments to well-formed structures.</Paragraph> <Paragraph position="3"> While these approaches differ in the way they determine the structure of the fragments and the deleted material, for the most pert they rely heavily, at some point, on the recognition of semantic word-classes. The purpose of this paper Is to describe the syntactic characteristics of sentence fragments and to lllustrate how the domeln-speclflc Information embodied In the cooCcurrence patterns of the semantic word-classes of a domain can be utilized as a powerful tool for processing a body of compact text, I.e. text that contains a large percentage of sentence fragments, II. IDENTIFICATION OF FRAGMENT TYPES The Nee York Unlversl~y Linguistic String Project has developed a computer program to analyze ccmpact text In special Ized subject areas using a general parsing program and an Engl Ish grammar augmented by procedures speclflc to the subJect areas. In recent years the system has been tailored for computer analysis of free-text medical records, which are characterized by numerous sentence fragments. In the computer-analysis and processing of the medical records, relatlvely few types of sentence fragments sufflced to describe the shortened forlas, a l though such fragments ccmprfsed fully 49% of the natural language input CMarsh and Sager, 1982\]. Fragment types can be related to full forms on the basis of the elements which are regularly delirfed. Elements deleted fr~n the fragments are fr~a one or more of the syntactic posltlons: subject, tense, verb, obJect. The six fragment types Identlfled in the set of medical records are shown In Table 1 as types i-Vl.</Paragraph> <Paragraph position="4"> A feature of fragment types that Is not Imedlately obvious ts the fact that they are already known In the ful I grammar as parts of ful let constructions. The fragment types reflect deletions found in syntactically distinguished positions wlthin full sentences, as Illustrated in Table 2. For e~ample, In normal English, a sentence that contains tense and the verb be can occur as the object of verbs like find (e.g. She found that the sent~ce was ~). In the same environment, as obJect of find, a reduced sentence can occur \[n which the tense and verb be have been omitted, as In fragment type I (e.g. She found the sentence ~lllJ;~). In the same manner, other reduced forms reflected in fragment types also represent constructions generally found as ~arts of regular English sentences.</Paragraph> <Paragraph position="5"> The fact that the fragment types can be related to full English forms makes It possible to v Iee thee as Instances of reduced SURJECT-VEI~-(~JECT patterns free which particular components have been deleted. Fragments of type I can be represented as having a deleted tense and verb be, of type II as having a deleted subject, tense, and verb be, etc. This makes it relatively straightforward to add thee to the parslng grammar,</Paragraph> </Section> <Section position="4" start_page="99" end_page="101" type="metho"> <SectionTitle> INFINITIVAL PREDICATE THEY TOOK THE TRAIN TO AVOID THE TRAFFIC. </SectionTitle> <Paragraph position="0"> and, at the same time, provides a framework for Identifying their semantic content by relating thm to the corresponding full forms.</Paragraph> <Paragraph position="1"> The number of fragment types that occur In compact text of different technical domains appears to be relatlvely limited. When the fragment types found In medical records were compared wlth those seen In a smell sample of Navy equipment status messages, five of the slx types found in the medlcal records were also found In the Navy messages. Only one additional fragment type was required to cover the Navy messages. This type appears In Table I as type Vll, in which two subjects have been deleted (Reauest advise for Dick ~Q.).</Paragraph> <Paragraph position="2"> While the number of fragment types Is relatively constant, the distribution of fragment types varies according to the domain of the text. Table 3 shows distributions for each of the fragment types Identified in Table 1. For e~ample, In Table 3, while fragment type IV, from which subject, tense, and verb have been deleted, is most frequent In medical records, It is a much less frequent type In the Navy messages. On the other hand, type VI, from whlch a subject has been deleted, Is relatively Infrequent In medical records, but much more frequent in Navy messages. In addition, the different sections of the input differ with respect to the ratio of fragments 1-o whole sentences and in the types of fro~ments they contain. For e~unple, the different sections of the medical records that were analyzed (e.g.</Paragraph> <Paragraph position="3"> HISTORY, EXAM, LAB-DATA, IMPRESSION, COURSE IN HOSPITAL) were distinguished by differences in the distribution of the fragment types. The EXAM paragraph of the medical texts, In which the physician describes the results of the patient's physical eK~lnatlon, contained a relatively large number of fragments of type I11, especially adjective phrases. The COURSE IN HOSPITAL paragraph contained a larger number of complete sentences than the other paragraphs.</Paragraph> <Paragraph position="4"> TABLE \]. DISTRIBUTION OF FRAGMENT TYPES TYPE MEDiCAl NAVY I. 22% 36% if. I% 6% iii. 12% 11% IV. 61% 15% v. I% 0% vl. 2$ 28% vtl. 0% 4~ III. RECONSTRUCTION OF DELETIONS The deletions which relate fragment types to their full sentence forms fall Into two main classes: (I) those found virtually In all texts and Ill) those speclflc to the domain of the text. Just as the fragment types can be viewed as Incomplete realizations of syntac-Nc S-V-O structures, the semantic patterns In sentence fragments can be considered Incomplete reallzatlons of the semantic S-V-O patterns. In general terms, the structure of Information In technical domains can be specified by a set of semantlc classes, the words and phrases which belong to these classes, and by a speclflcatlon of the pal'~erns these classes enter in'to, l.e. the syntactic relationships among the members of +he classes \[Grlshmen, et el., 1982; Sager, 1978\]. In +he case of the medical sublenguage processed by the Llngulstlc StTlng Project, the medical subclasses were derlved through techniques of distributional analysis \[Hlrschmen and Sager, 1982\]. Semantlc S-V-O pet-I'erns were then derived from the comblnatory properties of the medical classes in the text \[Marsh and Sager, 1982\]; +he semantic pat~rerns Identified In a text are specific to the domain of +he text. Whlle they serve to formulate sublanguage constraints which rule out incorrect syntactic analyses caused by structural or l exlcal ambiguity/, these relationships among classes can also provide a means by which deleted elements in compact text can be reconstructed. When a fragment Is recognized as an Instance of a given semantic pattern, It Is +hen possible to specify a set of the semantic classes from which the medical sublanguage class of +he deleted element can be selected.</Paragraph> <Paragraph position="5"> On a superflclal level, the deletions of be In fragment types Ic-f and Ilia-b, for example, can be reconstructed on purely syntac~'lc grounds by fllllng In the l exical Item be. However, It Is also possible to provide further Information and specify the semantic class of the lex lcal Item be by reference to the semantlc S-V-O pat-tern manifested by the occurring subject and object. For e~emple, In type If fragment skin no ~ruotlons, skin has the medical subclass BODYPART, and eruntlons has +he medlcal subclass SIGN/SYMFrrOM. The semantic S-V-O pat-tern In which these classes play a part Is= BODYPART-SHOWVERB-SIGN/SYMPTOM (as In Skln showed no eruntlons). Be can then be assigned the semantic class SHOWVERB. protein ~, type It, enters Into the semantic pal-~ern: TEST-~STVERB-TES13~ESULT and be can be assigned the class TESI~/ERB, which relates a TEST subject wlth a TESllRESULT object. Assigning a semantic class to the reconstructed be maximizes Its Informational content.</Paragraph> <Paragraph position="6"> In addition to reconstructing a dlstlngulshed l exlcal Item, like +he verb be, along with Its semantic classes, It Is also possible to specify the set of semantic classes for a deleted element, even +hough a l exlcal Item Is not Immediately reconstructable. For e~emple, the fragment To recelv9 follc ~,J.~o of Type VI, contains a verb of the PI~/ERB&quot; class and a MEDICATION-obJect, but the subject has b~n deleted. The only semantic pad-tern which permits a verb and object wlth these medical subclasses Is the S-V-O pattern: PATIENT-PTVERB-MEDICATION Through recogn{tlon of the semantic pattern in which +he occurring elements of the fragment play a role, the semantic class PATIENT can be specified for +he deleted subject, p~tlent Is one of the distinguished words In the domain of narrative medical records which are often not explicitly mentloned In the text, although they play a role In the sementlc patterns.</Paragraph> <Paragraph position="7"> The S-V-O relations, of which the fragment i~/pes are Incomplete realizations, form the basis of a procedure which specifies the semantic classes of deleted elements In fragments. Under the best conditions, the set of semantic classes for the deleted form contains only one element. It Is also possible, however, for the set to contain more than one semantic class. For example, the t~fpe la fragment Pain also noted }n hands ~ knees, when regularized to normal active S-V-O word order as noted oaln In hands and knees, has a deleted subject. The set of possible medical classes for the deleted subject consists of ~PATIENT, FAMILY, OocrrOR}, since * fragment with a verb of the OBSERVE class, such as note, and an object of the SIGN/SYMPTOM class, such as oaln, can enter ~rtc</Paragraph> </Section> class="xml-element"></Paper>