File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1075_metho.xml

Size: 24,555 bytes

Last Modified: 2025-10-06 14:13:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1075">
  <Title>(ALMOST) AUTOMATIC SEMANTIC FEATURE EXTRACTION FROM TECHNICAL TEXT</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
(ALMOST) AUTOMATIC SEMANTIC FEATURE
EXTRACTION FROM TECHNICAL TEXT
Rajeev Agarwal*
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Acquisition of semantic information is necessary for proper understanding of natural language text. Such information is often domain-speclfic in nature and must be acquized from the domain. This causes a problem whenever a natural fanguage processing (NLP) system is moved from one domain to another. The portability of an NLP system can be improved if these semantic features can be acquired with \];mited human intervention. This paper proposes an approach towards (almost) automatic semantic feature extraction.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
I. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Acquisition of semantic information is necessary for proper understanding of natural language text. Such information is often domaln-specific in nature and must be acquized from the domain. When an NLP system is moved from one domain to another, usually a substantial amount of time has to be spent in tailoring the system to the new domain. Most of this time is spent on acquiring the semantic features specific to that domain. It is important to automate the process of acquisition of semantic information as much as possible, and facilitate whatever human intervention is absolutely necessary. Portability of NLP systems has been of concern to researchers for some time \[8, 5, 11, 9\]. This paper proposes an approach to obtain the domaln-dependent semantic features of any given domain in a domain-independent manner.</Paragraph>
    <Paragraph position="1"> The next section will describe an existing NLP system (KUDZU) which has been developed at Mississippi State University. Section 3 will then present the motimstion behind the automatic acquisition of the semantic features of a domain, and a brief outline of the methodology proposed to do it. Section 4 will describe the dlf\[ereut steps in this methodology in detail Section 5 will focus on the app\]icatlons of the semantic features. Section 6 compares the proposed approach to R;ml\]ar research efforts. The last section presents some final comments.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="378" type="metho">
    <SectionTitle>
2. THE EXISTING KUDZU SYSTEM
</SectionTitle>
    <Paragraph position="0"> The research described in this paper is part of a larger on-going project called the KUDZU (Knowledge Under Development from Zero Understanding) project. This project is aimed at exploring the automation of extraction of information from technical texts. The KUDZU system has two primary components -- an NLP component, and a Knowledge Analysis (KA) component. This section desoribes this system in order to facilitate understanding of the approach described in this paper.</Paragraph>
    <Paragraph position="1"> The NLP component consists of a tagger, a semi-purser, a prepositional phrase attachment specialist, a conjunct identifier for coordinate conjunctions, and a restructuzer. The tagger is an u-gram based program that currently generates syntactic/semantic tags for the words in the corpus. The syntactic portion of the tag is mandatory and the semantic portion depends upon whether the word has any special domain-specific classification or not. Currently only nouns, gerunds, and adjectives are assigned semantic tags.</Paragraph>
    <Paragraph position="2"> For example, in the domain of veterinary medicine, adog&amp;quot; would be assigned the tag &amp;quot;nounmpatient, m &amp;quot;nasal&amp;quot; would be &amp;quot;adj--body-part, m etc.</Paragraph>
    <Paragraph position="3"> The parsing strategy is based on the initial identification of simple phrases, which are later converted to deeper structures with the help of separate specialist programs for coordinate conjunct identification and prepositional phrase attachment.</Paragraph>
    <Paragraph position="4"> The result for a given sentence is a single parse, many of whose elements are comparatively underspecified. For example, the parses generated lack clause boundaries. Nevertheless, the results are surprisingly useful in the extraction of relationships from the corpus.</Paragraph>
    <Paragraph position="5"> The semi-parser recognises noun-, verb-, prepositional-, gerund-, infinitive-, and s~iectival-phrases. The prepositional phrase attachment specialist \[2\] uses case grammar analysis to disambiguate the attachments of prepositional phrases and is highly domain-dependent. The current iraplemcntation of this subcomponent is highly specific to the domnin of veterinary medicine, the initial testbed for the KUDZU system. Note that all the examples presented in this paper will be taken from this domain. The coordinate conjunction specialist identifies pairs of appropriate conjuncts for the coordinate conjunctions in the text and is domain-independent in nature \[1\]. The restructurer puts together the information acquired by the specialist programs in order to provide a better (and deeper) structure to the parse.</Paragraph>
    <Paragraph position="6"> &amp;quot;This research is supported by the NSP-ARPA grant number IRI 9314963.</Paragraph>
    <Paragraph position="7"> Before being passed to the knowledge analysis portion of the system, some parses undergo manual modification, which is  facilitated by the help of an editing tool especially written for this purpose. A large percentage of the modifications can be attributed to the \];m;tation of the conjunct identifier in recognising ouly pairs of conjuncts, as opposed to all conjuncts of coordinate conjunctions.</Paragraph>
    <Paragraph position="8"> The KA component receives appropriate parses from the NLP component and uses them to populate an object-oriented knowledge base \[4\]. The nature of the knowledge base created is dependent on a domain schema which specifies the concept hierarchy and the relationships of interest in the domain. Examples of the concept hierarchy specifications and relationships present in the domain schema are given in Fig- null Many such relationships may be defined in the domain schema. While processing a sentence, the KA component hypothesises the types of relationships that may be present in it. However, before the actual relationship is instantiated, objects corresponding to the mandatory slots must be foundj either directly or by the help of an algorithm that resolves indirect and implied references. If objects corresponding to the optional slots are found, then they are also filled in, Currently, the domain schema has to be generated manually after a careful evaluation of the domain. This is a time-consuming process that often requ~.res the help of a domain expert. Once the schema has been specified, the rest of the KA component is domain independent \[4\], with the exception of s domain-specific synonym table. For each sentence that is processed by the KA component, s set of semantic relationships that were found in the sentence is produced.</Paragraph>
    <Paragraph position="9"> An interface to the KA component allows users to navigate through all instances of the different relationships that have been acquired from the corpus. Two sample sentences from the veterinary medicine domain, their parsed output and the relations extracted from them are shown in Figure 2.</Paragraph>
  </Section>
  <Section position="5" start_page="378" end_page="379" type="metho">
    <SectionTitle>
3. OUTLINE OF THE PROPOSED
APPROACH
</SectionTitle>
    <Paragraph position="0"> The automatic acquisition of semantic features of a domain is an important issue, since it assists portability by reducing the amount of human intervention. In the context of the KUDZU system in particular, it is desired that the system be moved from the domain of veterinary medicine to that of physical chemistry. As explained above, certain components of the system are domain-dependent and have to be significantly modified before the system can be used for a new domain. The current research aims to use the acquired semantic features in order to improve the portability of the KUDZU system.</Paragraph>
    <Paragraph position="1"> It is important to note that although the initial motivation for this research came from the need to move the KUDZU systern to a new domain, the underlying techniques are generic and can be used in a variety of applications. The primary goal is to acquire the semantic features of the domain with rn;nlmal human intervention, and ensure that these features can be applied to different systems rather easily. In this research, two main types of semantic features are of interest a concept hierarchy for the domain, and lexico-semantic patterns present in the domain. These patterns are i;m;lar to what are also known as &amp;quot;selectional constraints&amp;quot; or &amp;quot;selectional patterns&amp;quot; \[6, 5\] in systems which use them primarily to determine the correct parse from a large number of parses generated by a broad coverage grammar. They are basically co-occurrence patterns between meaningful t nouns, gerunds, adjectives, and verbs. For example, &amp;quot;DISORDER of BODY-PAKT', &amp;quot;MEDICATION can be used to TREAT-VERB PA-TIENT&amp;quot;, etc. are legitimate lexico-semantic patterns from the veterinary medicine domain.</Paragraph>
    <Paragraph position="2"> The steps involved in the acquisition of semantic features from the domain can be briefly outlined as follows:  I. Generate the syntactic tags for all the words in the corpus. null 2. A\]gorithmically identify the expUclt semantic dusters that may exist in the current domain. Apply the clustering algorithm separately to nouns, gerunds, adjectives, and verbs.</Paragraph>
    <Paragraph position="3"> 3. Use the syntactic tags and semantic classes to automate  the identification of lexico-semantic patterns that exist in the given domain.</Paragraph>
    <Paragraph position="4"> This basic methodology is ~miIar to some other approaches adopted in the past \[8, 6\]. However, some important differences exist which will be discussed later.</Paragraph>
    <Paragraph position="5"> Once the semantic features have been obtained, they can be used in a variety of ways depending upon the needs of the NLP system. They can be helpful in improving the portsbility of an NLP system by providing useful semantic information that may be needed by different components of the system. In the KUDZU system, these features will be used to improve the success rate of a domaln-independent syntactically based prepositional phrase attachment specialist, and for automatic acquisition of the domain schema.</Paragraph>
    <Paragraph position="6"> It is easy to see how the lexico-semantic patterns can be help IAs explained later, a memd~al word is one that has a semantic tag auociated with it.</Paragraph>
    <Paragraph position="7">  ful in the attachment of prepositions/ phrases. All patterns that have some preposition embedded within them will essentially provide selectional constraints for the classes of words that may appear in the object and host slots. These patterns ~ be used to improve the success rate of a domain-independent syntactically based prepositional phrase attachment specialist. There is ample evidence \[10, 15\] that semantic categories and co\]\]ocational patterns can be used effectively to assist in the process of prepositional phrase attachment. null These semantic features will also be used to automatically generate the domain schema for any given domain. Figure 1 contains examples of the semantic class hierarchy and the relationships of interest, as defined in the domain schema. The former may be acquired from the semantic clustering process. The specification of the relationships can be achieved with the help of the weighted lexico-seme~xtie patterns. Some of the relationships can be acquired by an automated comparlson of al\] patterns involving a given semantic verb class. Other relationships may be determined by comparing other patterns with common noun and gerund semantic classes.</Paragraph>
    <Paragraph position="8"> The resulting domain schema, in some sense, represents the semantic structure of the domain.</Paragraph>
  </Section>
  <Section position="6" start_page="379" end_page="381" type="metho">
    <SectionTitle>
4. ACQUISITION OF SEMANTIC
FEATURES
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="379" end_page="379" type="sub_section">
      <SectionTitle>
4.1. Tagging
</SectionTitle>
      <Paragraph position="0"> It was decided that the tag set used by the Penn treebank project, with u few exceptions, be adopted for tagging the corpus. Unlike the Penn treebank tag set, we have separate tags for auxiliaries, gerunds, and subordinate conjunctions (rather than clumping subordinate conjunctions with prepositions). Therefore, as a first step in the process of acquisition of semantic features, the corpus is tagged with appropriate tags. Brill's rule-based tagger \[3\] is being used for this purpose. This step is primarily domain-independent, although the tagger may have to be further trained on a new domain.</Paragraph>
      <Paragraph position="1"> Since this tag set is purely syntactic in nature, the semantic clusters of the words must be acquired by a different method.</Paragraph>
    </Section>
    <Section position="2" start_page="379" end_page="380" type="sub_section">
      <SectionTitle>
4.2. Identification of Semantic Clusters
</SectionTitle>
      <Paragraph position="0"> The identification of such semantic clusters (which provide the concept hierarchy) is the next step. Nouns, gerunds, adjectives, and verbs axe to be clustered into separate semantic hierarchies. A traditional clustering system m COBWEB/3 is used to cluster words into their semantic categories.</Paragraph>
      <Paragraph position="1"> Since COBWEB/3 \[13\] requires attribute-value vectors assodated with the entities to be clustered, such vectors must be defined. The attributes used to define these vectors should be chosen to reflect the lexico-syntactic context of the words because the semantic category of s word is strongly influenced by the context in which it appears. The proposed methodology involves specifying a set of lexico-syntactic attributes  separately for nouns, gerunds, adjectives, and verbs. Presumably, the syntactic constraints that affect the semantic category of a noun are different from those that affect the category of gerunds, adjectives and verbs. Currently, three attributes are being used for noun clustering u subj,,.,.b (verb whose subject is the current noun), obj,~.,.b (verb whose object is the current noun), and host~.p (preposition of which the current noun is an object). The top i values that satisfy the attribute subj..,.b, top j values of ob~...b, and top k values of host~..p are of interest. A cross-product of these values yields the attribute-value vectors. For example, if i = 3, j = 3, and k = 2 are used, 3 x 3 x 2 = 18 vectors are generated for each noun. These values are generated by a program from the phrasal structures produced by the semi-parser. The same attributes can be used across different domains, and hence the attrlbute.value vectors needed for semantic clustering can be generated with no human intervention. Some examples of the semantic clusters that may be identified in the domain of veterinary medicine are DIS-ORDER, PATIENT, BODY-PART, MEDICATION, etc. for nouns; DIAGNOSTIC-PROC, TREATMENT-PROC, etc.</Paragraph>
      <Paragraph position="2"> for gerunds; DISORDER, BODY-PART, etc. for adjectives; CAUSE-VERB, TREAT-VEP~B, etc. for verbs.</Paragraph>
      <Paragraph position="3"> The clustering technique is not expected to generate completely correct clusters in one pass. However, the improperly classified words Hill not be manually reclassified at this stage. In order to attain proper hierarchical clusters, the process of clustering may have to be performed again after lexico-semantic patterns have been discovered by the process described below. The only human intervention required at the present stage is for the assignment of class identifiers to the generated classes. In fact, a human is shown small sub-clusters (each with 8-10 objects 2) of the generated lfierarchy, and is asked to label these sub-clusters with a semantic label, if possible. Note that not all such sub-clusters should have semantic labels -- several nouns in the corpus are generic nouns that cannot be classified into any semantic class. However, a majority of the sub-clusters should represent the semantic classes that exist in the domain. The class identifiers thus assigned are then associated as semantic tags with these words and used to discover the lex/co-semantic patterns in the next step. Any word that has a semantic tag is considered to be meaningful</Paragraph>
    </Section>
    <Section position="3" start_page="380" end_page="381" type="sub_section">
      <SectionTitle>
4.3. Discovery of Lexico-Semantic
Patterns
</SectionTitle>
      <Paragraph position="0"> The semantic clusters obtained from the clustering procedure, after the manual assignment of class identifiers, are used to identify the lexico-semantic patterns. The phrase.</Paragraph>
      <Paragraph position="1"> level parsed structures produced by the semi-parser are analysed for dJ~erent patterns. These patterns are of the form subject-verb-object, noun-noun, adjective.noun, NP-PP, and VP-NP-PP, where NP, VP, and PP refer to noun-, verb-, and prepositional-phrases respectively. All patterns that oc-ZNote that this does not reflect the size of the ~na\] clusters generated by the program, since words that eventually should belong to the same cluster may initially be in different sub-clusters.</Paragraph>
      <Paragraph position="2"> cur in the corpus more often than some pre.defined threshold are assumed to be important and are saved. One restriction currently being placed on these patterns is that st least two meaningful words must be present in every pattern.</Paragraph>
      <Paragraph position="3"> The patterns are weighted on the bask of their frequency of occurrence, w/th the more frequent patterns getting higher weights.</Paragraph>
      <Paragraph position="4"> It seems reasonable to assume that if lexico-semantic patterns were already known to the system, the identification of semantic categories would become easier and vice.versa. In this research, we propose to first identify semantic categories and then the patterns. It has long been realized that there is an interdependence between the structure of a text and the semantic classification that is present in it. HalUday \[7\] stated that ~' ... there is no question of discovering one before the other. ~ We believe that an appro~mate classi.~catlon can be achieved before the structures are fully identified. However, this interdependence between classification and structure win have its adverse effects on the results. It is anticipated that the results of both semantic clustering and pattern dlseovery Hill not be very accurate in the first pass. Therefore, an iteratlve scheme is being proposed.</Paragraph>
      <Paragraph position="5"> After semantic clustering has been performed, human intervention is needed to assign class identifiers to the generated clusters. These identifiers assist in the proper discovcry of lexico-semantic patterns. The resulting set of patterns may contain some Lrrelevant patterns and human intervention is needed to accept/reject the automatically generated patterns. Both the accepted and rejected patterns axe stored by the system so that in future iterations, the same patterns do not need human verification. As has been shown before \[8, 6, 14\], such patterns place constraints on the semantic classes of words appearing in particular contexts. The set of selected patterns can, therefore, be used to reanalyse the corpus in order to recogn~e the incorrectly clustered words in the previously generated class hierarchy and to suggest the correct class for these words. For example, if the word &amp;quot;pen/cfllin&amp;quot; is incorrectly clustered as a DISORDER, an analysis of the corpus win show that it appears most frequent|y as a MEDICATION in patterns llke &amp;quot;TKEAT-VEB.B DISOB.-DER with MEDICATION&amp;quot;, &amp;quot;MEDICATION can be used to TREAT-VER.B PATIENT&amp;quot;, etc. and rarely as a DISORDER in the DISORDER patterns. Hence, its semantic category can be guessed to be MEDICATION. This guess is added to the fist of attributes for the words, and semantic clustering is performed again. This iterntlve mechanism assists clustering in two ways -- firstly, the additional attribute helps convergence towards better clusters and secondly, the tentative semantic classes from the i,a iteration can be used to generate values for attributes for the (i + 1) ta iteration s, thus reducing the sparsity of data. This time better clusters should be formed, and these will again be used to recognize lexico-semantic patterns. We expect the system to converge to a stable set of clusters and patterns after a small number of iterations. A simple diagram outlining the process of a~Vhen attempting to duster nounJs, for exaxnple, the sem~ntic classes for gerunds, verbs, and adjectives are used.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="381" end_page="381" type="metho">
    <SectionTitle>
5. COMPARISON TO OTHER
APPROACHES
</SectionTitle>
    <Paragraph position="0"> The most significant work that is similar to the proposed methodology is that conducted on the Linguistic String Project \[8\] and the PROTEUS project \[6\] at New York University. Recent efforts have been made by Grkhman and Sterling \[6\] towards automatic acqnisition of selectlonal constrnints. The technique proposed here for the acquisition of semantic features should require only limited human intervention. Since the semi-parsed structures can directly be used to generate the attribute-value vectors needed for clustering, the only human intervention should be in a~nment of class identifiers, acceptance/rejection of the discovered patterns, and reelass~ca~on of some concepts that may not get correctly classified even after the feedback from the lexico-semantic patterns. Further, their approach \[8, 6\] uses a large broad coverage grammar, and often several parses are produced for each sentence. The basic parsing strategy adopted in the KUDZU system starts with simple phrasal parses which can be used to acquire the semantic features of the domaJ.u. These semantic features can then be used for disambigustion of syntactic attachments and thus in providing a better and deeper structure to the parse.</Paragraph>
    <Paragraph position="1"> Sekine at el. \[16\] have also worked on this idea of &amp;quot;gradual approximation&amp;quot; using an iterative mechanism. They describe a scenario for the determination of internal structures of Japanese compound nouns. They use syntactic tuples to generate co\]location statistics for words which are then used for clustering. The clusters produced by their program are much smaller in size (appror;mately 3 words per cluster) than the ones attempted in our research. We intend to generate much larger clusters of words that intnitively belong to the same semantic category in the ~ven dom~n. The semantic categories generated by the clustering process are used for the identification of semantic relationships of interest in the domain. Most of the emphasis in the research undertaken at New York University on selectional constraints \[8, 6\] and that in Se\]dne's work has been on using the co\]locations for improved syntactic ans/y~s. In addition to using them to disamb~uate prepositional phrase attachments, we will also use them to generate a domain schema which is fundamental to the knowledge extraction process.</Paragraph>
    <Paragraph position="2"> Our approach of consolidating several lexico-semantic patterns into frame-Irks structures that represent the semantic structure of the domain is similar to one discussed by Marsh \[12\]. Several other efforts have been made towards n~ing semantic features for doms/n-specific dictionary creation or parsing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML