File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1003_metho.xml
Size: 17,042 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1003"> <Title>A SIMPLE BUT USEFUL APPROACH TO CONJUNCT IDENTIFICATION 1</Title> <Section position="3" start_page="0" end_page="15" type="metho"> <SectionTitle> THE RESEARCH PROJECT </SectionTitle> <Paragraph position="0"> This research on conjunct identification is a part of a larger research project which is exploring the automation of extraction of information from structured reference manuals.</Paragraph> <Paragraph position="1"> The largest manual available to the project in machine-readable form is the Merck Veterinary Manual, which serves as the primary testbed.</Paragraph> <Paragraph position="2"> The system semi-automatically builds and updates its knowledge base. There are two components to the system - an NLP (natural language processing) component and a knowledge analysis component. (See Figure 4 at the end.) The NLP component consists of a tagger, a semi-parser, a prepositional phrase attachment specialist, a conjunct identifier for coordinate conjunctions, and a restructurer. The tagger is a probabilistic program that tags the words in the manual. These tags consist of two parts - a mandatory syntactic portion, and an optional semantic portion. For example: the word 'cancer' would be tagged as noun//disorder, the word 'characterized' would be verb~past_p, etc. The semantic portion of the tags provides domain-specific information. The semi-parser, which is not a full-blown parser, is responsible for identifying noun, verb, prepositional, gerund, adjective, and infinitive phrases in the sentences. Any word not captured as one of these is left as a solitary 'word' at the top level of the sentence structure. The output produced by the semi-parser has very little embedding and consists of very simple structures, as will be seen below.</Paragraph> <Paragraph position="3"> The prepositional phrase attachment disambiguator and the conjunct identifier for coordinate conjunctions are considered to be &quot;specialist&quot; programs that work on these simple structures and manipulate them into more deeply embedded structures. More such specialist programs are envisioned for the future. The restructurer is responsible for taking the results of these specialist programs and generating a deeper structure of the sentence. These deeper structures are passed on to the knowledge analysis component.</Paragraph> <Paragraph position="4"> The knowledge analvsis comnonent is responsible for extracting from these structures several kinds of objects and relationships to build and update an object-oriented knowledge base.</Paragraph> <Paragraph position="5"> The system can then be queried about the information contained in the text of the manual. This paper primarily discusses the conjunct identifier for coordinate conjunctions. Detailed information about the other components of the system can be found in \[Hodges et al., 1991\], \[Boggess et al., 1991\], \[Agarwal, 1990\], and \[Davis, 1990\].</Paragraph> </Section> <Section position="4" start_page="15" end_page="15" type="metho"> <SectionTitle> CONJUNCT IDENTIFICATION </SectionTitle> <Paragraph position="0"> The program assigns a case label to every noun phrase in the sentence, depending on the role that it fulfills in the sentence. A large proportion of the nouns of the text have semantic labels; for the most part, the case label of a noun phrase is the label associated with the head noun of the noun phrase. In some instances, a preceding adjective influences the case label of the noun phrase, as, for example, when an adjective with a semantic label precedes a generic noun. A number of the resulting case labels for noun phrases (e.g. time, location, etc.)are similar those suggested by Fillmore \[1972\], but domain dependent case labels (e.g. disorder, patient, etc.) have also been introduced. For example: the noun phrase &quot;a generalized dermatitis&quot; is assigned a case label of disorder, while &quot;the ear canal&quot; is given a case label of body_part. It should be noted that, while the coordination algorithm assumes the presence of semantic case labels for noun phrases, based on semantic tags tor the text, it does not depend on the specific values of these labels, which change from domain to domain.</Paragraph> </Section> <Section position="5" start_page="15" end_page="18" type="metho"> <SectionTitle> THE ALGORITHM </SectionTitle> <Paragraph position="0"> The algorithm makes the simplifying assumption that each coordinate conjunction conjoins only two conjuncts. One of these appears shortly after the conjunction and is called the post-conjunct, while the other appears earlier in the sentence and is referred to as the pre-conjunct.</Paragraph> <Paragraph position="1"> The identification of the post-conjunct is fairly straightforward: the first complete phrase that follows the coordinate conjunction is presumed to be the post-conjunct. This has been found to work in all of the sentences on which this algorithm has been tested. The identification of the pre-conjunct is somewhat more complicated. There are three different levels of rules that are tried in order to find the matching pre-conjunct. These are referred to as level-l, level-2, and level-3 rules in decreasing order of importance. The steps involved in the identification of the pre- and the post-conjunct are described below.</Paragraph> <Paragraph position="2"> (a) The sentential components (phrases or single words not grouped into a phrase by the parser) are pushed onto a stack until a coordinate conjunction is encountered.</Paragraph> <Paragraph position="3"> (b) When a coordinate conjunction is encountered, the post-conjunct is taken to be the immediately following phrase, and its type (noun phrase, prepositional phrase, etc.) and case label are noted.</Paragraph> <Paragraph position="4"> (c) Components are popped off the stack, one at a time, and their types and case labels are compared with those of the post-conjunct. For each component that is popped, the rules at level-1 and level-2 are tried first. If both the type and case label of a popped component match those of the post-conjunct (level-I rule), then this component is taken to be the pre-conjunct.</Paragraph> <Paragraph position="5"> Otherwise, if the type of the popped component is the same as that of the post-conjunct and the case label is compatible (case labels like medication and treatment, which are semantically sentence(\[ noun_phrase(ease_label(body_part}, \[(~hC/, det), (C/~r, noun I body_part)\]) verb_phrase(\[(should, aux), (be, aux), (cleaned, verb l past_p)l) prep_phrase(\[(by, prep), gerund_phrase(\[(flushing, verb I gerund)I)1) word(\[(away, advl Ilocation)\]) noun_phrase(ease label{unknown), \[(the, det), (debris, noun)\]) word(\[(and, conj I co ord)\]) noun_phrase(ease_label{body_fluld), \[(exudate, noun l I body_fluid)\]) gerund_phrase(\[(using, verb I gerund), noun_phrase(ease_label{medication}, \[(warm, adj), (saline, adj I I medication), (solution, noun l I medication)\])\]) word(\[(or, conj I co_ord)\]) noun phrase(ease_label{unknown), \[(water, noun)\]) prep_phrase(\[(with, prep), noun phrase(ease_label{medication), \[(a, det), (very, adv I degree), (dilute, adj I I degree), (germicidal, adj I I medical), (detergent, noun I I medication)\])\]) word(\[(comma, punc)\]) word(\[(and, conj I co_ord)\]) noun_phrase(case_label{body_part), \[(the, det), fcanal, noun l I body_part)\]) verb_phrase(\[(dried, verb I past p)\]) word(\[(as, conj I correlative)I) word(\[(gently, adv)\]) word(\[(as, conj I correlative)\]) adj_phrase(\[(possible, adj)\]) \]).</Paragraph> <Paragraph position="6"> Figure 1 similar, are considered to be compatible) to that of the post-conjunct (level-2 rule), then this component is identified as the pre-conjunct. If the popped component satisfies neither of these rules, then another component is popped from the stack and the level- 1 and level-2 rules are tried for that component.</Paragraph> <Paragraph position="7"> (d) If no component is found that satisfies the level-1 or level-2 rules and the beginning of the sentence is reached (popping components off the stack moves backwards through the sentence), then the requirement that the case label be either the same or compatible is relaxed. The component with the same type as that of the post-conjunct (irrespective of the case label) that is closest to the coordinate conjunction, is identified as the pre-conjunct (level-3 rule). (e) If a pre-conjunct is still not found, then the post-conjunct is conjoined to the first word in the sentence.</Paragraph> <Paragraph position="8"> Although there is very little embedding of phrases in the structures provided by the semiparser, noun phrases may be embedded in prepositional phrases, infinitive phrases, and gerund phrases on the stack. The algorithm does permit noun phrases that are post-conjuncts to be conjoined with noun phrases embedded as objects of, say, a previous prepositional phrase (e.g., in the sentence fragment &quot;in dogs and cats&quot;, the noun phrase 'cats' is conjoined with the noun phrase 'dogs' which is embedded as the object of the prepositional phrase 'in dogs'), or other similar phrases.</Paragraph> <Paragraph position="9"> We have observed empirically that, at least for this fairly carefully written and edited manual, long distance conjuncts have a strong tendency to exhibit high degrees of parallelism. Hence, conjuncts that are physically adjacent may merely be of the same syntactic type (or may even be syntactically dissimilar); as the distance between conjuncts increases, the degree of parallelism tends to increase, so that conjuncts are highly likely to be of the same semantic category, and syntactic and even lexical repetitions are to be found (e.g., on those occasions when a post-conjunct is to be associated with a prepositional phrase that occurs 30 words previous, the preposition may well be repeated). The gist of the algorithm, then, is as follows: to look for sentential components with the same syntactic and semantic categories as the post-conjunct, first nearby and then with increasing distance toward the beginning of the sentence; failing to find such, to look for the same syntactic category, sentence(\[ prep_phrase(\[(with, prep), noun_phrase(\[(persistent, adjll time), (or, conjlco_ord), (untreated, adj), (otitis_externa, noun I I disorder)I)\]) word(\[(comma, pune)\]) noun phrase(\[(the, det), (epithelium, noun)\]) prep_phrase(\[(of, prep), noun phrase(\[(the, det), (ear, noun I I body_part),</Paragraph> <Paragraph position="11"> first close at hand and then with increasing distance, and if all else fails to default to the beginning of the sentence as the pre-conjunct (the semi-parser does not recognize clauses as such, and there may be no parallelism of any kind between the beginnings of coordinated clauses).</Paragraph> <Paragraph position="12"> Provisions must be made for certain kinds of parallelism which on the surface appear to be syntactically dissimilar - for example, the nearequivalence of noun and gerund phrases. In the text used as a testbed, gerund phrases are freely coordinated with noun phrases in virtually all contexts. Our probabilistic labelling system is currently being revised to allow the semantic categories for nouns to be associated with gerunds, but at the time this experiment was conducted, gerund phrases were recognized as conjuncts with nouns only on syntactic grounds a relatively weak criterion for the algorithm.</Paragraph> <Paragraph position="13"> Further, there are instances in the text where prepositional phrases are conjoined with adjectives or adverbs - the results reported here do not incorporate provisions for such. Consider the sentence &quot;The ear should be cleaned by flushing away the debris and exudate using warm saline solution or water with a very dilute germicidal detergent, and the canal dried as gently as possible.&quot; The semi-parser produces the structure shown in Figure 1. The second 'and' conjoins the entire clause preceding it with the clause that follows it in the sentence. Although the algorithm does not identify clause conjuncts, it does identify the beginnings of the two clauses, &quot;the ear&quot; and &quot;the canal&quot;, as the pre- and post-conjuncts, in spite of several intervening noun phrases. This is possible because the case labels of both these noun phrases agree (they arc both body_part).</Paragraph> </Section> <Section position="6" start_page="18" end_page="18" type="metho"> <SectionTitle> THE DRAWBACKS </SectionTitle> <Paragraph position="0"> Before reporting the results of an implementation of the algorithm on a 10,000 word chapter of the Merck Veterinary Manual we describe some of the drawbacks of the current implementation.</Paragraph> <Paragraph position="1"> (i) The algorithm assumes that a coordinate conjunction conjoins only two conjuncts in a sentence. This assumption is often incorrect. If a construct like \[A, B, C, and D\] appears in a sentence, the coordinate conjunction 'and' frequently, but not always, conjoins all four components. (B, for example, could be parenthetical.) The implemented algorithm looks for only two conjuncts and produces a structure like \[A, B, \[and \[C, DIll, which is counted as correct for purposes of reporting error rates below. Our &quot;coordinate conjunction specialist&quot; needs to work very closely with a &quot;comma specialist&quot; - an as-yet undeveloped program responsible for, among other things, identifying parallelism in components separated by commas.</Paragraph> <Paragraph position="2"> (ii) The current semi-parser recognizes certain simple phrases only and is unable to recognize clause boundaries. For the conjunct identifier, this means that it becomes impossible to identify two clauses with appropriate extents as conjuncts. The conjunct identifier has, however, been written in such a way that whenever a &quot;clause specialist&quot; is developed, the final structure produced should be correct.</Paragraph> <Paragraph position="3"> Therefore, the conjunct identifier was held responsible for correctly recognizing only the beginnings of the clauses that are being conjoined.</Paragraph> <Paragraph position="4"> Similarly, for phrases not explicitly recognized by the semi-parser, the current conjunct specialist is expected only to conjoin the beginnings of the phrases - not to somehow bound the extents of the phrases. Consider the sentence(\[ noun_phrase(\[(antibacterial, adj I I medication), (drugs,noun I plurall I medication)\]) verb_phrase(\[(administered, verb I past_p)\]) prep_phrase(\[(in, prep), noun_phrase(\[(the, det),(feed, noun)\])\]) verb phrase(\[(appeared, verb l beverb)\]) inf_phrase(\[(to, infinitive), verb_phrase(\[(be, verb lbeverb)\]), adj_phrase(\[(effective, adj)\])l) prep_phrase(\[(in, prep), noun_phrase(\[(some, adj I I quantity), (herds, noun lplural I I patient)\])\]) word(\[w(and, conj I co_ord)\]) prep_phrase(\[(with out, prep), noun_phrase(\[fbenefit, noun)\])\]) prep_phrase(\[(in, prep), noun_phrase(\[(others, pro I plural)\])\]) \]).</Paragraph> <Paragraph position="5"> Figure 3 sentence &quot;With persistent or untreated otitis externa, the epithelium of the ear canal undergoes hypertrophy and becomes fibroplastic.&quot; The structure received by the coordination specialist from the semi-parser is shown in Figure 2. In this sentence, the components &quot;undergoes hypertrophy&quot; and &quot;becomes fibroplastic&quot; are conjoined by the coordinate conjunction 'and'. The conjunct identifier only recognizes the verb phrases &quot;undergoes&quot; and &quot;becomes&quot; as the preand post-conjuncts respectively and is not expected to realize that the noun phrases following the verb phrases are objects of these verb phrases.</Paragraph> <Paragraph position="6"> (iii) Although it is generally true that the components to be conjoined should be of the same type (noun phrase, infinitive phrase, etc.), some cases of mixed coordination exist. The current algorithm allows for the mixing of only gerund and noun phrases. Consider the sentence &quot;Antibacterial drugs administered in the feed appeared to be effective in some herds and without benefit in others.&quot; The structure that the coordination specialist receives from the semi-parser is shown in Figure 3. Note that the prepositional phrases are eventually attached to their appropriate components, so that the phrase &quot;in some herds&quot; ultimately is attached to the adjective &quot;effective&quot;. The system does not include any rule for the conjoining of prepositional phrases with adjectival or adverbial phrases. Hence the phrases &quot;effective in some herds&quot; and &quot;without benefit in others&quot; were not conjoined.</Paragraph> </Section> class="xml-element"></Paper>