File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1052_metho.xml
Size: 5,335 bytes
Last Modified: 2025-10-06 14:13:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1052"> <Title>SEXTANT: EXPLORING UNEXPLORED CONTEXTS FOR SEMANTIC EXTRACTION FROM SYNTACTIC ANALYSIS</Title> <Section position="4" start_page="324" end_page="325" type="metho"> <SectionTitle> SEXTANT </SectionTitle> <Paragraph position="0"> SEXTANT can be run on any English text, without any pre-coding of domain knowledge or manual editing of the text. The input text passes through the following steps: (I) Morphological analysis. Each word is morphologically analyzed and looked up in a 100,000 word dictionary to find its possible parts of speech. (II) Grammatical Disambiguation. A stochastic parser assigns one grammatical category to each word in the text. These first two steps use CLARIT programs (Evans et al.</Paragraph> <Paragraph position="1"> 1991). (III) Noun and Verb Phrase Splitting.</Paragraph> <Paragraph position="2"> Each sentence is divided into verb and noun phrases by a simple regular grammar. (IV) Syntagmatic Relation Extraction. A fourpass algorithm attaches modifiers to nouns, noun phrases to noun phrases and verbs to noun phrases. (Grefenstette 1992a) (V) Context Isolation. The modifying words attached to each word in the text are isolated for all nouns. Thus the context of each noun is given by all the words with which it is associated throughout the corpus. (VI) Similarity matching. Contexts are compared by using similarity measures developed in the Social Sciences, such as a weighted Jaccard measure.</Paragraph> <Paragraph position="3"> As an example, consider the following sentence extracted from a medical corpus.</Paragraph> <Paragraph position="4"> Cyclophosphamide markedly prolonged induction time and suppressed peak titer irrespective of the time of antigen administration.</Paragraph> <Paragraph position="5"> Each word is looked up in a online dictionary.</Paragraph> <Paragraph position="6"> After grammatical ambiguities are removed by the stochastic parser, the phrase is divided into noun phrases(NP) and verb phrases(VP), Once each sentence in the text is divided into phrases, intra- and inter-phrase structural relations are extracted. First noun phrases are scanned from left to right(NPLR), hooking up articles, adjectives and modifier nouns to their head nouns. Then, noun phrases are scanned right to left(NPttL), connecting nouns over prepositions. Then, starting from verb phrases, phrases are scanned before the verb phrase for an unconnected head which becomes the subject(VPRL), and likewise to the right of the verb for objects(VPLtt), producing for the example: of relations that are considered as each word's context for similarity calculations. For example, one set of relations extracted by SEXTANT for the above sentence can be In this example, the word time is found modified by the words induction, prolong-DOBJ and administration, while administration is only considered by this set of relations to be modified by antigen. Over the whole corpus of 160,000 words, one can consider what modifies administration. Isolating these modifiers gives a list such as At this point SEXTANT compares all the other words in the corpus, using a user-specified similarity measure such the Jaccard measure, to find which words are most similar to which others. For example, the words found as most similar to administration in this medical corpus were the following words in order of most to least similar: administration injection, treatment, therapy, infusion, dose, response, ...</Paragraph> <Paragraph position="7"> As can be seen, the sense of administra- tion as in the &quot;administration of drugs and medicines&quot; is clearly extracted here, since administration in this corpus is most similarly used as other words such as injection and ther- apy having to do with dispensing drugs and medicines. One of the interesting aspects of this approach, contrary to the coarse-grained document co-occurrence approach, is that ad- ministration and injection need never appear in the same document for them to be recognized as semantically similar. In the case of this corpus, administration and injection were considered similar because they shared the following modifiers: acid follow-DOBJ growth prior produce-IOBJ dose extract increase-SUBJ intravenous treat-IOBJ associate-SUSJ associate-DOBJ rapid cause-SUBJ antigen adrenalectomy aortic hormone subside-IOBJ alter-IOBJ folio-acid amd folate It is hard to select any one word which would indicate that these two words were similar, but the fact that they do share so many words, and more so than other words, indicates that these words share close semantic characteristics in this corpus.</Paragraph> <Paragraph position="8"> When the same procedure is run over a corpus of library science abstracts, administration is recognized as closest to administration graduate, office, campus, education, director, ...</Paragraph> <Paragraph position="9"> Similarly circulation was found to be closest to flow in the medical corpus and to date in the library corpus. Cause was found to be closest to etiology in the medical corpus and to determinant in the library corpus. Frequently occurring words, possessing enough context, are generally ranked by SEXTANT with words intuitively related within the defining corpus.</Paragraph> </Section> class="xml-element"></Paper>