File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1706_metho.xml

Size: 18,785 bytes

Last Modified: 2025-10-06 14:08:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1706">
  <Title>XML-Based NLP Tools for Analysing and Annotating Medical Language</Title>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Deep Grammatical Analysis
</SectionTitle>
    <Paragraph position="0"> As part of our work with OHSUMED,wehave been attempting to improve the coverage of a handcrafted, linguistically motivated grammar which provides full-syntactic analysis paired with logical forms. The grammar and parsing system we use is the wide-coverage grammar, morphological analyser and lexicon provided by the Alvey Natural Language Tools (ANLT) system (Carroll et al. 1991, Grover et al. 1993). Our first aim was to increase coverage up to a reasonable level so that parse ranking techniques could then be applied.</Paragraph>
    <Paragraph position="1"> The ANLT grammar is a feature-based unification grammar based on the GPSG formalism (Gazdar et al., 1985). In this framework, lexical entries carry a significant amount of information including subcategorisation information. Thus the practical parse success of the grammar is significantly dependent on the quality of the lexicon. The ANLT grammar is distributed with a large lexicon and, while this provides a core of commonly-occurring lexical entries, there remains a significant problem of inadequate lexical coverage. If we try to parse OHSUMED sentences using the ANLT lexicon and no other resources, we achieve very poor results (2% coverage) because most of the medical domain words are simply not in the lexicon and there is no 'robustness' strategy built into ANLT. Rather than pursue the labour-intensive course of augmenting the lexicon with domain-specific lexical resources, we have developed a solution which does not require that new lexicons be derived for each new domain type and which has robustness built into the strategy. Furthermore, this solution does not preclude the use of specialist lexical resources if these can be used to achieve further improvements in performance.</Paragraph>
    <Paragraph position="2"> Our approach relies on the sophisticated XML-based tokenisation and POS tagging described in the previous section and it builds on this by combining POS tag information with the existing ANLT lexical resources. We preserve POS tag information for content words (nouns, verbs, adjectives, adverbs) since this is usually reliable and informative and we dispose of POS tags for function words (complementizers, determiners, particles, conjunctions, auxiliaries, pronouns, etc.) since the ANLT hand-written entries for these are more reliable and are tuned to the needs of the grammar. Furthermore, unknown words are far more likely to be content words, so knowledge of the POS tag will most often be needed for content words.</Paragraph>
    <Paragraph position="3"> Having retained content word tags, we use them during lexical look-up in one of two ways. If the word exists in the lexicon with the same basic category as the POS tagthenthePOS tag plays a 'disambiguating' role, filtering out entries for the word with different categories. If, on the other hand, the word is not in the lexicon or it is not in the lexicon with the relevant category, then a basic underspecified entry for the POS tag is used as the lexical entry for the word, thereby allowing the parse to proceed.</Paragraph>
    <Paragraph position="4"> For example, if the following partially tagged sentence is input to the parser, it is successfully parsed. We studied VBD the value NN of transcutaneous JJ carbon NN dioxide NN monitoring NN during transport NN Without the tags the parse would fail since the word transcutaneous is not in the ANLT lexicon. Furthermore, monitoring is present in the lexicon but as a verb and not as a noun. For both these words, ordinary lexical look-up fails and the entries for the tags have to be used instead. Note that the case of monitoring would be problematic for a strategy where tagging is used only in case lexical look-up fails, since here it is incomplete rather than failed.</Paragraph>
    <Paragraph position="5"> The implementation of our word tag pair look-up method is specific to the ANLT system and uses its morphological analysis component to treat tags as a novel kind of affix. Space considerations preclude discussion of this topic here but see Grover and Lascarides (2001) for further details.</Paragraph>
    <Paragraph position="6"> Another impediment to parse coverage is the prevalence of technical expressions and formulae in biomedical and other technical language. For example, the following sentence has a straightforward overall syntactic structure but the ANLT grammar does not contain specialist rules for handling expressions such as 5.0+/-0.4 grams tension and thus the parse would fail.</Paragraph>
    <Paragraph position="7"> Control tissues displayed a reproducible response to bethanechol stimulation at different calcium concentrations with an ED50 of 0.4 mM calcium and a peak response of 5.0+/-0.4 grams tension.</Paragraph>
    <Paragraph position="8"> Our response to issues like these is to place a further layer of processing in between the output of the initial tokenisation pipeline in Figure 3 and the input to the parser. Since the ANLT system is not XML-based, we already use xmlperl to convert sentences to the ANLT input format of one sentence per line with tags appended to words using an underscore. We can add a number of other processes at this point to implement a strategy of using fsgmatch grammars to package up technical expressions so as to render them innocuous to the parser. Thus all of the following 'words' have been identified using fsgmatch rules and can be passed to the parser as unanalysable units. The classification of these examples as nouns reflects a hypothesis that they can slot into the correct parse as noun phrases but there is room for experimentation since the conversion to parser input format can rewrite the tag in any way.</Paragraph>
    <Paragraph position="10"> In addition to these kinds of examples, we also package up other less technical expressions such as common multi-word words and spelled out numbers: null</Paragraph>
    <Paragraph position="12"> In order to measure the effectiveness of our attempts to improve coverage, we conducted an experiment where we parsed 200 sentences taken at random from OHSUMED. We processed the sentences in three different ways and gathered parse success rates for each of the three methods. Version 1 established a 'no-intervention' baseline by using the initial pipeline in Figure 3 to identify words and sentences but otherwise discarding all other mark-up. Version 2 addressed the lexical robustness issue by retaining POS tags to be used by the grammar in the way outlined above. Version 3 applied the full set of preprocessing techniques including the packaging-up of formulaic and other technical expressions. The parse results for these runs are as follows: Version 1 Version 2 Version 3 Parses 4 (2%) 32 (16%) 79 (39.5%) Even in Version 3, coverage is still not very high but the difference between the three versions demonstrates that our approach has made significant inroads into the problem. Moreover, the increase in coverage was achieved without any significant alterations to the general-purpose grammar and the tokenisation of formulaic expressions was by no means comprehensive.</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Shallow Analysis
</SectionTitle>
    <Paragraph position="0"> In contrast to the full syntactic analysis experiments described in the previous section, here we describe two distinct methods of shallow analysis from which we acquire frequency information which is used to predict lexical semantic relations in a particular kind of noun compound.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 The Task
</SectionTitle>
      <Paragraph position="0"> The aim of the processing in this task is to predict the relationship between a deverbal nominalisation head and its modifier in noun-noun compounds such as tube placement, antibody response, pain response, helicopter transport. In these examples, the meaning of the head noun is closely related to the meaning of the verb from which it derives and the relationship between this noun and its modifier can typically be matched onto a relationship between the verb and one of its arguments. For example, there is a correspondence between the compound tube placement and the verb plus direct object string place the tube. When we interpret the compound we describe the role that the modifier plays in terms of the argument position it would fill in the corresponding verbal construction: tube placement object antibody response subject pain response to-object helicopter transport by-object We can infer that tube in tube placement fills the object role in the place relation by gathering instances from the corpus of the verb place and discovering that tube occurs more frequently in object position than in other positions and that the object interpretation is therefore more probable.</Paragraph>
      <Paragraph position="1"> To interpret such compounds in this way, we need access to information about the verbs from which the head nouns are derived. Specifically, for each verb, we need counts of the frequency with which it occurs with each noun in each of its argument slots. Ultimately, in fact, in view of the sparse data problem, we need to back off from specific noun instances to noun classes (see Section 4.4). The current state-of-the-art in NLP provides a number of routes to acquiring grammatical relations information about verbs, and for our experiment we chose two methods in order to be able to compare the techniques and assess their utility.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 Chunking with Cass
</SectionTitle>
      <Paragraph position="0"> Our first method of acquiring verb grammatical relations is that used by Lapata (2000) for a similar task on more general linguistic data. This method uses Abney's (1996) Cass chunker which uses the finite-state cascade technique. A finite-state cascade is a sequence of non-recursive levels: phrases at one level are built on phrases at the previous level without containing same level or higher-level phrases. Two levels of particular importance are chunks and simplex clauses. A chunk is the non-recursive core of intra-clausal constituents extending from the beginning of the constituent to its head, excluding post-head dependents (i.e., NP, VP, PP), whereas a simplex clause is a sequence of non-recursive clauses (Abney, 1996). Cass recognizes chunks and simplex clauses using a regular expression grammar without attempting to resolve attachment ambiguities. The parser comes with a large-scale grammar for English and a built-in tool that extracts predicate-argument tuples out of the parse trees that Cass produces. Thus the tool identifies subjects and objects as well as PPs without however distinguishing arguments from adjuncts. We consider verbs followed by the preposition by and a head noun as instances of verb-subject relations.</Paragraph>
      <Paragraph position="1"> Our verb-object tuples also include prepositional objects even though these are not explicitly identified by Cass. We assume that PPs adjacent to the verb and headed by either of the prepositions in, to, for, with, on, at, from, of, into, through, upon are prepositional objects.</Paragraph>
      <Paragraph position="2"> The input to the process is the entire OHSUMED corpus after it has been converted to XML,tokenised, split into sentences and POS tagged using ltpos as described in Section 2. The output of this tokenisation is converted to Cass's input format which is a non-XML file with one word per line and tags separated by tab. We achieve this conversion using xmlperl with a simple rule file. The output of Cass and the grammatical relations processor is a list of each verb-argument pair in the corpus: manage :obj refibrillation respond :subj psoriasis access :to system</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 Shallow Parsing with the Tag Sequence
Grammar
</SectionTitle>
      <Paragraph position="0"> Our second method of acquiring verb grammatical relations uses the statistical parser developed by Briscoe and Carroll (1993, 1997) which is an extension of the ANLT grammar development system which we used for our deep grammatical analysis as reported in Section 3 above. The statistical parser, known as the Tag Sequence Grammar (TSG), uses a hand-crafted grammar where the lexical entries are for POS tags rather than words themselves. Thus it is strings of tags that are parsed rather than strings of words. The statistical part of the system is the parse ranking component where probabilities are associated with transitions in an LR parse table. The grammar does not achieve full-coverage but on the OHSUMED corpus we were able to obtain parses for 99.05% of sentences. The number of parses found per sentence ranges from zero into the thousands but the system returns the highest ranked parse according to the statistical ranking method. We do not have an accurate measure of how many of the highest ranked parses are actually correct but even a partially incorrect parse may still yield useful grammatical relations data.</Paragraph>
      <Paragraph position="1"> In recent developments (Carroll and Briscoe, 2001), the TSG authors have developed an algorithm for mapping TSG parse trees to representations of grammatical relations within the sentence in the following format: These centres are efficiently trapped in proteins at low</Paragraph>
      <Paragraph position="3"> This format can easily be mapped to the same format as described in Section 4.2 to give counts of the number of times a particular verb occurs with a particular noun as its subject, object or prepositional object.</Paragraph>
      <Paragraph position="4"> As explained above, the TSG parses sequences of tags, however it requires a different tagset from that produced by ltpos, namely the CLAWS2 tagset (Garside, 1987). To prepare the corpus for parsing with the TSG we therefore tagged it with Elworthy's (1994) tagger and since this is a non-XML tool we used xmlperl to invoke it and to incorporate its results back into the XML mark-up. Sentences were then prepared as input to the TSG--this involved using xmlperl to replace words by their lemmas and to convert to ANLT input format: These DD2 centre NN2 be VBR efficiently RR trap VVN in II protein NN2 at II low JJ temperature NN2 The lemmas are needed in order that the TSG outputs them rather than inflected words in the grammatical relations output shown above.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.4 Compound Interpretation
</SectionTitle>
      <Paragraph position="0"> Having collected two different sets of frequency counts from the entire OHSUMED corpus for verbs and their arguments, we performed an experiment to discover (a) whether it is possible to reliably predict semantic relations in nominalisation-headed compounds and (b) whether the two methods of collecting frequency counts make any significant difference to the process.</Paragraph>
      <Paragraph position="1"> To collect data for the experiment we needed to add to the mark-up already created by the basic pipeline in Figure 3, (a) to mark up deverbal nominalisations with information about their verbal stem to give nominalisation-verb equivalences and (b) to mark up compounds in order to collect samples of two-word compounds headed by deverbal nominalisations. For the first task we combined further use of the lemmatiser with the use of lexical resources.</Paragraph>
      <Paragraph position="2"> In a first pass we used the morpha lemmatiser to find the verbal stem for -ing nominalisations such as screening and then we looked up the remaining nouns in a nominalisation lexicon which we created by combining the nominalisation list which is provided by UMLS (2000) with the NOMLEX nominalisation lexicon (MacLeod et al., 1998) As a result of these stages, most of the deverbal nominalisations can be marked up with a VSTEM attribute whose value is the verbal stem:  grammar for compounds of all lengths and kinds and we used this to process a subset of the first two years of the corpus.</Paragraph>
      <Paragraph position="3"> We interpret nominalisations in the biomedical domain using a machine learning approach which combines syntactic, semantic, and contextual features. Using the LT XML program sggrep we extracted all sentences containing two-word compounds headed by deverbal nominalisations and from this we took a random sample of 1,000 nominalisations. These were manually disambiguated using the following categories which denote the argument relation between the deverbal head and its modifier: SUBJ (age distribution), OBJ (weight loss), WITH (graft replacement), FROM (blood elimination), AGAINST (seizure protection), FOR (nonstress test), IN (vessel obstruction), BY (aerosol administration), OF (water deprivation), ON (knee operation), and TO (treatment response). We also included the categories NA (non applicable) for nominalisations with relations other than the ones predicted by the underlying verb's subcategorisation frame (e.g., death stroke)andNV (non deverbal) for compounds that were wrongly identified as nominalisations. null We treated the interpretation of nominalisations as a classification task and experimented with different features using the C4.5 decision tree learner (Quinlan, 1993). Some of the features we took into account were the context surrounding the candidate nominalisations (encoded as words or POS-tags), the number of times a modifier was attested as an argument of the verb corresponding to the nominalised head, and the nominalisation affix of the deverbal head (e.g., -ation, -ment). In the face of sparse data, linguistic resources such as WordNet (Miller and Charles, 1991) and UMLS were used to recreate distributional evidence absent from our corpus. We obtained several different classification models as a result of using different marked-up versions of the corpus, different parsers, and different linguistic resources. Full details of the results are described in Grover et al. (2002); we only have space for a brief summary here. Our best results achieved an accuracy of 73.6% (over a baseline of 58.5%) when using the type of affixation of the deverbal head, the TSG, and WordNet for recreating missing frequencies. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML