File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/79/j79-1036_evalu.xml
Size: 19,745 bytes
Last Modified: 2025-10-06 13:59:54
<?xml version="1.0" standalone="yes"?> <Paper uid="J79-1036"> <Title>Grammatical Compression in Notes and Records: Analysis and Cornputdtion Barbara B. Anderson, Irwin D. J. Bross, and</Title> <Section position="15" start_page="116" end_page="116" type="evalu"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> ~inguistic mechanisms of compression are used when making notes within a context where the objects and meanings are known. Mechanisms of compressidn in medical records for a collaborative study of breast cancer are described. The syntactic devices were mainly deletion of words having a special status in the grammar of the whole language and deletion in particular positions of word+ having a special sta&us in the sublanguage. The deIeted forms are described and sublanguage Qord classes defined. A subcorpus of the medical records was parsed by an existing computer parsing system; a component covering the deletion-forms was added to the granunar. Modifications to t,he computer grammar are discussed and the parsing results are summarized.</Paragraph> <Paragraph position="1"> Introduction All 1anguages&quot;have mechanisms of compression.</Paragraph> <Paragraph position="2"> Sentences may be embedded within other sentenaes by means of nominalization and complementation.</Paragraph> <Paragraph position="3"> Various grammatical transformations involve deletion of certain parts of the sentence. In medical records, we find entries such as no evidence of metastases, which may be said to be derived Trea something like There is no evidence of metastases. Such incomplete sentences are not common in the spoken language of the medical records (i.e. dictated reports). However when physiciakrs themselves are requirbd to write- material for records, compression mechanisms are qmmonly use&.</Paragraph> <Paragraph position="4"> Although this paper will deal with a mific corpus, similar devices would I often be used for compression in other s-ations where there is pressure to write as little as possible, Legal, educational, and scientific recordg where informal notes are kept woum be other examples of this class of sitqations. The original motivation for this study was to develop effective methods for storing &e information in a medical record and to be able to retrieve this information for purposes of research, medical care, or administration. Fsoan previous research, the feasibility of verbatim input of dictated narrative has been established, Computerized extraction of the information has been shown to be feaeible i~ a test system ACORN (Automated Coding of Report ~arrative). his system has been described in detail in a series of previous papers. 1dt3 For a highly structured medical record where the entries are single words or very restricted sentences, the feasibility'of computer-assisted editing and coding has also been established. A procedure for typing in the entdes verbatim in a medical record,called 'TICPIS' (Type-In Coding and Editing System) ha8 been reported e1sew)rere. However, the thitd, intermediate class of material cannot be handled by ACORN or by TICES. Therefore, a linguistic analysis of this type of material has been undertaken with the ultimate objective of setting up a comprehensive eomputer system that can handle almost everything in the medical records.</Paragraph> <Paragraph position="5"> In the earlier effoxts to develop natural language technology, the work was facilitated by the fact that the documents involved were strictly for the trans- null mission of factual information. Such documents are regarded as important both by the persons who are filling them out and by the persons who read them. In this no-nonsense situation where the record may be critically reviewed by the peers of the person who is reporting the information, unambiguous and informative transmission of information is a critical need. Some of the simplicities in the present analysis may be~eculiar to ws type of situatfon.</Paragraph> <Paragraph position="6"> The existence of a subculture with shared training, objectives, and experience may facilitate the note-taking process in somewhat the same way that a person taking notes for himself can somehow be more concise without ambiguity. ..</Paragraph> <Paragraph position="7"> Howeveb, r many other note-taking situations would involve subculture, though not necessarily a medical one, and the findings here might be expected to have sdne general applicability.</Paragraph> <Paragraph position="8"> Source of Material The medical&es discussed here are ffom tjhe records of the Surgical Adjutrant Breast Project, a nationwide collaborative study involving 36 medical institutions. The records were filled out by medical and paramedical personnel at the participating institutions and cehtralized at Ro$glell Park Nemri&l Institute in a Statistical unit under thq direction\of-Dr. Nelson S-lack. A sample of approximately 50 was taken from the 2734 cage histories of patients in the program and is being used in the lbguistic analysis.</Paragraph> <Paragraph position="9"> Each case history ordinarily consiats df 3-6 pages of detailed information on the patient's initial status, treatment, pathology report, nledicai problems, and subsequent fate. When the structured information in the record was excluded, each case history had between 6 and 26 notated items, each item consisting of 1 t6 5 partial-sentences. While this material is speckalized to me purposes of the collaborative study, this type of information iq fairly typical of what is found in the usual hospital record.</Paragraph> <Paragraph position="10"> The notes were typed vexbath using An IBM Mag Card Communicator so as to obtain simultaneously a typed paper document and a record in computer-usable form. This device is used in the data-input sgstem of T~CES; an existing system for handling completely structured records. It would presumably be usea in any extension of TICES which would handle medical nates. In eis'analysis the computer was used to reorganize the material in a fbrmmore convenient for manual analysis by the linguist.</Paragraph> <Paragraph position="11"> Anderson analyzed the linguistic structure of the entries in a sample of the medical records involving radiation findings, A discussion of this analysis will take up the next part of the paper. Sager and associates used some of the findings from this study to develop methods for processing these same medical records by computef, adapting% program and grananar which had been developed fok parsing science articles. This project will be discussed in the final part of the paper.</Paragraph> <Paragraph position="12"> Linguistic Characteristics of Medical Notes Many of the entries on the medical records are in the form of notes which are neither complete sentences nor single word entries, but linguistic strings of an intermediate type, which we will hereafter call fragments, Fragments are a compressed typ of linguistic material resulting from various transformations which have the effect of making linguistic strings shorter by reducing or deleting materihl. The writer of these stretches of material must make his entries brief, in order to save time and effort, but also make them informative and unambiguous. For this reason the deleted material has to be easily recoverable, or in other words it must not contain much information. An analysis of the fragments shows that deletion is maiinly of a small class of sentence parts: (1) tense and the verb - be (t - be); (2) subject, tense and the verb - be; (3) the subject; and (4) subject, tense, and verb (V) other than - be.</Paragraph> <Paragraph position="13"> A second characteristic of fragments which makes deleted material recoverable is that both the meted material and the remainders consist of words in easily defined subclasses, based on both distributional and semantic criteria. These subclasses are easily defined because of the nature of the sublanguage; in general the vocabulary is limited and each word has a limited semantic range. The question on a form khich is being answered can also be used as a basis for retoring deleted material.</Paragraph> <Paragraph position="14"> One of the most commonly deleted items in the medical records is t - be (1 and 2). Tense is perhaps the most important information - be gives.</Paragraph> <Paragraph position="15"> The deletion of tense in the medical records causes no ambiguity because usually the physician describes the situation at the time of filling out the report, Otherwise he gives the time in a time phrase: x-rays on November 2.</Paragraph> <Section position="1" start_page="116" end_page="116" type="sub_section"> <SectionTitle> Fragment Types </SectionTitle> <Paragraph position="0"> In Table 1 we list the fragment types, giving an example of each, but not with all occurring word subcl&!3ses.</Paragraph> <Paragraph position="1"> The types will Sirst be given according to what material is deleted and then will be futther subclassed according to the highest nodes of the tree structure of the remainder. The material in brackets is the word subclasses which are assumed .So have been deldted. The word subclassbs should have three characteristics: (1) they should enable deleted material to be recovered, (2) they should make it possible to extract and store informational units such as those in ACORN and (3) they should be defined so that a linguistically unsophisticated person can easily put words into their subclasses.</Paragraph> <Paragraph position="2"> The word subclasses ate based on both semantic and distributional criteria. To a large extent nouns can conveniently be subclassed on a semantic basis and verbs can be subclassed on a distributional basis, according ta the subclasses of nouns which they take as subject and object. Due to the nature of the sub-language there is relatively little overlap (e.g., a given verb is likely to take only one noun subclass as shject) compared to what we would find in the language as a whole.</Paragraph> <Paragraph position="3"> Two impoftant subclasses of h-n nouns used in the medical records are N-physician and N-patieht. Each has only a few members, but is important because many verbs chqacteristically take it as subject or object, and also because both, but particularly N-physician, are usually deleted. It is on the basis of the verbs which characteristically take them as subject or object that they can usually be recovered without ambiguity.</Paragraph> <Paragraph position="4"> Other noun subclasses concern more directly the subject matter of the reports, the concrete objects with which the physician is dealing. Unlike N-physician and N-patient, these classes usually have many mmbers and they are seldom deleted. As with N-physician and N-patient, certain uerb subclasses char~cteristically take them as subject or object.</Paragraph> <Paragraph position="5"> abdomen, axilla, bone, Br-t, cervix, pelvis change, elevation, enlargement, gab, increase pressure,' rate, rhythm, size, weight carcinoma, cough, disease, edema, fibxosis 6iopsy, exam, film, qamogram; scan, x-xay area, field, floor, lobe, neck, part, regionr she, her, patient, lady, woman doctor, he, him, his, I,*M.D., radiologist drug, insulin, medication, medicine, radiation date, month, the, visit, winter, year appear, feel, indicate, remain, represent, seem alter, clear, change, enlarge, heal, progress detect, find, identify, ncyte, observe, see ah=, give, leave, place, readmit, see, transfer, trqat complain, come, moperate, enter, feel, gain, go, have, refuse, show, suf f ;r , take feel, have, place, tel.1, t'kansfer, treat, See show, demonstrate, indicate, reveal, suggest axillary, bony, clavicular, lumbar, pelvic elevated, enlarged, healed, stable, unchanged.</Paragraph> <Paragraph position="6"> considerable, extensive, intermittent, little absent, evident, Fnown, possible, present active, bad, benign, degenerative, firm, hard, malignant, metastatpc, nodular adjoining, distal, dorsal, frontal, left clear, free, healthy, negative, normal Computer Parsing of Medical Records ' To test the linguistic analysis, a subset of the manually analyzed corpus of medical records was parsed by computer, using the NYU Linguistic String Parser. The LSP grqmmar of English is based on the same linguistic principles as the ACORN grammar. Hence it could also serve to test the feasibility of adding a note-handling capability to the ACORN-TICES system. The LSY sylr which was designed for text-processinQ, was adapted to the parsing of medical records by deleting portions of the grammar which are not required for this type of material and adding a section covering sentence fragments.</Paragraph> <Paragraph position="7"> These change$ are described below, followed by the parsing resultb.</Paragraph> <Paragraph position="8"> The corpus which was parsed consisted of 12 sections of the Radiation Findings extracted in their order of appearance from the medical records. These sections contained 245 sentences or sentence fragments (word sequences ending in a period). Of these, 37 were complete English sentences and 205 were fragments; 3 were combinations of both types. 21 entries were identical to others in the corpus, accounting in all for 139 of the sentences ox sentence fragments. Of the complete sentences, same were quite long, e.g., Reexamination shows some scarring and thickening over the right apex which is perhaps slightly more evident than it was before, but nothing is seen that is typical of tumor involvement. Typical sxorter sentences are Chest films on 10-25-68 and 12-14-68 do not show any essential changes since- last reports, Liver scan 1-29-69 was normal. Fragments were, as predicted, of the types listed in Table 1, above, though not all tyMs were represent-ed in the parsed corpus.</Paragraph> <Paragraph position="9"> Table 3 shows the new definitions or redefinitioqd which were added to the LSP grammar to cover fragments. These definitions are written in ~ahs-Naur Form (BNF), as ilze all the ca. 180 definitions which comprise the context-free-part of the LSP English grammar. The BNF definitions are used by the parser to construct a tree representing the structure of the input sentence. In addition to BNF definitions, the grammar contains restrictions, which test the sentence trees for grammatical and selectional well-formedness. The For more explanation of the LSP system and grammar, see N. Sager and starting, or root, definition of the gramnqr is SENTENCE, so this is tha first definition seen in Table 3. In the case of medical records, the unit may be. longer than one sentence, but we have retained the root-word SENTENCE and defined SENTENCE in this case to be a TEXTLET (definition 2), ,which.consists of a sentence (called OLD-SENTENCE, definition 3) optionally followed by more sdntences (MORESENT, definition 4). The definition of OLD-SENTENCE has the same three elements (INTRODUCER, CENTER, ENDMARK) that the definition of SENTENCE does in the LSP grammar; however, in this case, tho INTRODUCER (definiqion 5) is NULL; the CENTER (definition 6) contains an option FRAGMENT in addition to the options ASSERTION and IMPERATIVE defined in the English grammar; (other~options of CENTER, e.g. QUESTION, have been deleted); and the ENDMARK (definition 10) contains unconventional punctuation, such as dashes and cornma, in addition to the period and semicolon. Since our main interest here is in FRAGMENT (definition 7), we will elaborate onlhis definition.</Paragraph> <Paragraph position="10"> R. Grishman, &quot;The Restrictton Language for Computer Grammars of TFatural Language' Commun. of the ACM, 18, 390-400, 1975, and the references cited there. In defining FRAGMENT, we have used parts of the grammar which were defined independently of the fragment problem. That this is possible is in itself a partial verificatian of the conclusion from manual analysis that only limited, grammatically specifiable, deletion-forms occur in the fragments seen in notes and records. For example, the dropping of the verb (type 1 of Table 1) can occur in normal English when a sentence containing the verb - be occurs as the object of a verb like - find, e.g. We found the chest clear to pekcussion and auscultation. itn the UP grammar there is an object string defined for such occurrences; it is calleg SOBJBE (Subject - + Object of - be) . This same string can then be made an option of CENTER to analyze fragments having the same PSom e,g.</Paragraph> <Paragraph position="11"> Cheet clear to.percussion and auscultation.</Paragraph> <Paragraph position="12"> In detail, the definition of FRAGMENT begins with the element SA (Sentence -Adjunct). The definition' of SA (not shown here) contaihs 16 options covering all types of sentence modifiers* In this material the most frequent SA is a the expzession, usually a date [called PDATE, for optional Preposition + date) or this examination, this visit. Following SA in the definition FRAGMENT are the options proper, naming definitions already in the LSP grammar. The first option SOBJBESHOW (Subject - + Obj ect of - be or - show), corresponds to the second and third structures of type 1 and also occurrences like Chest film no change, which is an expansion of SOBJBE, discussed above. This option covers deletions of the two most common verbs in this material, - be and - show. The plade of - be or - show (definition 8) in a fragment is either empty or is filled by a dash. The mkond and fourth options, ASTG and VENPASS, in FRAGMENT correspond to structures of type 2 in Table 1 (e.g., Negative_, felt to be a benign lesion), where the subject, tense and verb - be have been dropped. In the LSP grammar, ASTG (Adjective - - strind is an option of OBJBE, and VEWASS (V-en - passive string) is also permitted after -C be and in other places. The thitd option, NSTG Noun strind, is an object of show e.g., Mild degenerative chanqes (PS$om, X-rays show - -' mild*degenerative changes). It ale0 covers occurrences of the first structure of type 1 (e.g. No X-rays taken) where for regularity with more complete entries the passive Verb (taken) is seen as a right adjunct of the noun. The last option, consisting of NSTG followed by either ASSERTION or SOBJBESHOW, covers such occurrences as PA and lateral chest 1l,-5-71 reexamination shows some scarring and thickening over the right apex. where a noun phrase (PA and'lateral chest 11-571) precedes an assertion about that ngun phrase.</Paragraph> <Paragraph position="13"> Space permits only a few remarks about these definitions. It was helpful to order the options so that the longer options precede the shorter ones, since some of the shorter options (e.g., NSTG) can have the &form as the first element of the longer ones. This is not required in parsing texts, since in full senthces there is usually no other way of fitting in the remainder of the sentence. Also, in text sentences, many nouns require a preceding determiner, so that compound nouns are not split into separate noun phrases.</Paragraph> <Paragraph position="14"> In this material, determiners are rarely emplbyed, so this constraint cannot be applied. Thi~, combined wim verb deletions and the use of commas both in the text and as sentence sepamtors, makes fof a great deal of syntactic ambiguity. However, as the next section shows, it was possible to obtain the intended parse as the firs* output in most cases. This was accomplished without using the subclasses special to the medical material, which are used in a subsequent stage of processing preparatory to information retrieval.</Paragraph> </Section> </Section> class="xml-element"></Paper>