File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1034_metho.xml
Size: 19,773 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1034"> <Title>XML-Based Data Preparation for Robust Deep Parsing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Improving the Lexical Component </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Strategy </SectionTitle> <Paragraph position="0"> The ANLT grammar is a unification grammar based on the GPSG formalism (Gazdar et al., 1985), which is a precursor of more recent 'lexicalist' grammar formalisms such as HPSG (Pollard and Sag, 1994). In these frameworks lexical entries carry a significant amount of information including subcategorisation information. Thus the practical parse success of a grammar is significantly dependent on the quality of the lexicon. The ANLT grammar is distributed with a large lexicon which was derived semi-automatically from a machine-readable dictionary (Carroll and Grover, 1988). This lexicon is of varying quality: function words such as complementizers, prepositions, determiners and quantifiers are all reliably hand-coded but content words are less reliable. Verbs are generally coded to a high standard but the noun and adjective lexicons are full of redundancies and duplications. Since these duplications can lead to huge increases in the number of spurious parses, an obvious first step was to remove all duplications from the existing lexicons and to collapse certain ambiguities such as the count/mass distinction into single underspecified entries. A second critical step was to increase the character set that the spelling rules in the morphological analyser handle, so as to accept capitalised and non-alphabetic characters in the input. Once these ANLT-internal problems are overcome, the main problem of inadequate lexical coverage still remains: if we try to parse OHSUMED sentences using the ANLT lexicon and no other resources, we achieve very poor results because most of the medical domain words are simply not in the lexicon and there is no 'robustness' strategy built into ANLT. One solution to this problem would be to find domain specific lexical resources from elsewhere and to merge the new resources with the existing lexicon. However, the resulting merged lexicon may still not have sufficient coverage and a means of achieving robustness in the face of unknown words would still be required. Furthermore, every move to a new domain would depend on domain-specific lexical resources being available. Because of these disadvantages, we have pursued an alternative solution which allows parsing to proceed without the need for extra lexical resources and with robustness built into the strategy. This alternative strategy does not preclude the use of domain specific lexical resources but it does provide a basic level of performance which further resources can be used to improve upon.</Paragraph> <Paragraph position="1"> The strategy we have adopted relies first on sophisticated XML-based tokenisation (see Section 3) and second on the combination of POS tag information with the existing ANLT lexical resources. Our view is that POS tag information for content words (nouns, verbs, adjectives, adverbs) is usually reliable and informative, while tagging of function words (complementizers, determiners, particles, conjunctions, auxiliaries, pronouns, etc.) can be erratic and provides less information than the hand-written entries for function words that are typically developed side-by-side with wide coverage grammars. Furthermore, unknown words are far more likely to be content words than function words, so knowledge of the POS tag will most often be needed for content words. Our idea, then, is to tag the input but to retain only the content word POS tags and use them during lexical look-up in one of two ways.</Paragraph> <Paragraph position="2"> If the word exists in the lexicon then the POS tag is used to access only those entries of the same basic category. If, on the other hand, the word is not in the lexicon then a basic underspecified entry for the POS tag is used as the lexical entry for the word. In the first case, the POS tag is used as a filter, accessing only entries of the appropriate category and cutting down on the parser's search space. In the second case, the basic category of the unknown word is supplied and this enables parsing to proceed. For example, if the following partially tagged sentence is input to the parser, it is successfully parsed.2 We have developed VBN a variable JJ suction NN system NN for irrigation NN , aspiration NN and vitrectomy NN Without the tags there would be no parse since the words irrigation and vitrectomy are not in the ANLT lexicon. Furthermore, tagging variable as an adjective ensures that the noun entry for variable is not accessed, thus cutting down on parse numbers (3 versus 6 in this case).</Paragraph> <Paragraph position="3"> The two cases interact where a lexical entry is present in the ANLT lexicon but not with the relevant category. For example, monitoring is present in the ANLT lexicon as a verb but not as a noun: We studied VBD the value NN of transcutaneous JJ carbon NN dioxide NN monitoring NN during transport NN Look up of the word tag pair monitoring NN fails and the basic entry for the tag NN is used instead. Without the tag, the verb entry for monitoring would be accessed and the parse would fail. In the following example the adjectives diminished and stabilized exist only as verb entries: with the JJ tag the parse succeeds but without it, the verb entries are accessed and the parse fails.</Paragraph> <Paragraph position="4"> There was radiographic JJ evidence NN of cus et al., 1994): JJ labels adjectives, NN labels nouns and VB labels verbs.</Paragraph> <Paragraph position="5"> Note that cases such as these would be problematic for a strategy where tagging was used only when lexical look-up failed, since here lexical look-up doesn't fail, it just provides an incomplete set of entries. It is of course possible to augment the grammar and/or lexicon with rules to infer noun entries from verb+ing entries and adjective entries from verb+ed entries. However, this will increase lexical ambiguity quite considerably and lead to higher numbers of spurious parses.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Implementation </SectionTitle> <Paragraph position="0"> We expect the technique outlined above to be applicable across a range of parsing systems. In this section we describe how we have implemented it within ANLT.</Paragraph> <Paragraph position="1"> The version of the ANLT system described in Carroll et al. (1991) and Grover et al. (1993) does not allow tagged input but work by Briscoe and Carroll (1993) on statistical parsing uses an adapted version of the system which is able to process tagged input, ignoring the words in order to parse sequences of tags. We use this version of the system, running in a mode where 'words' are looked up according to three distinct cases: a2 word look-up: the word has no tag and must be looked up in the lexicon (and if look-up fails, the parse fails) a2 tag look-up: the word has a tag, look-up of the word tag pair fails, but the tag has a special hand-written entry which is used instead a2 word tag look-up: the word has a tag and look-up of the word tag pair succeeds.</Paragraph> <Paragraph position="2"> The resources provided by the system already adequately deal with the first two cases but the third case had to be implemented. The existing morphological analysis software was relatively easily adapted to give the performance we required. The ANLT morphological analyser performs regular inflectional morphology using a unification grammar for combining morphemes and rules governing spelling changes when morphemes are concatenated. Thus a plural noun such as patients is composed of the morphemes patient and +s with the features on the top node being inherited partially from the noun and partially from the inflectional affix:</Paragraph> <Paragraph position="4"> In dealing with word tag pairs, we have used the word grammar to treat the tag as a novel kind of affix which constrains the category of the lexical entry it attaches to. We have defined morpheme entries for content word tags so they can be used by special word grammar rules and attached to words of the appropriate category. Thus patient NN is analysed using the noun entry for patient but not the adjective entry. Tag morphemes can be attached to inflected as well as to base forms, so the stringpatients NNS has the following internal structure:</Paragraph> <Paragraph position="6"> In defining the rules for word tag pairs, we were careful to ensure that the resulting category would have exactly the same feature specification as the word itself. Thus the tag morpheme is specified only for basic category features which the word grammar requires to be shared by word and tag. All other feature specifications on the covering node are inherited from the word, not the tag. This method of combining POS tag information with lexical entries preserves all information in the lexical entries, including inflectional and subcategorisation information. The preservation of subcategorisation information is particularly necessary since the ANLT lexicon makes sophisticated distinctions between different subcategorisation frames which are critical for obtaining the correct parse and associated logical form.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 XML Tools for Pre-Processing </SectionTitle> <Paragraph position="0"> The techniques described in this section, and those in the previous section, are made possible by our use of an XML processing paradigm throughout. We use the LT TTT and LT XML tools in pipelines where they add, modify or remove pieces of XML mark-up. Different combinations of the tools can be used for different processing tasks. Some of the XML programs are rule-based while others use maximum entropy modelling.</Paragraph> <Paragraph position="1"> We have developed a pipeline which converts OHSUMED data into XML format and adds linguistic annotations. The early stages of the pipeline segment character strings first into words and then into sentences while subsequent stages perform POS tagging and lemmatisation. A sample part of the output of this basic pipeline is shown in Figure 1. The initial conversion to XML and the identification of words is achieved using the core LT TTT program fsgmatch, a general purpose transducer which processes an input stream and rewrites it using rules provided in a grammar file. The identification of sentence boundaries, mark-up of sentence elements and POS tagging is done by the statistical program ltpos (Mikheev, 1997). Words are marked up as W elements with further information encoded as values of attributes on the W elements. In the example, the P attribute's value is a POS tag and the LM attribute's is a lemma (only on nouns and verbs). The lemmatisation is performed by Minnen et al.'s (2000) morpha program which is not an XML processor. In such cases we pass data out of the pipeline in the format required by the tool and merge its output back into the XML mark-up.</Paragraph> <Paragraph position="2"> Typically we use McKelvie's (1999) xmlperl program to convert out of and back into XML: for ANLT this involves putting each sentence on one line, converting some W elements into word tag pairs and stripping out all other XML mark-up to provide input to the parser in the form it requires.</Paragraph> <Paragraph position="3"> We are currently experimenting with bringing the labelled bracketing of the parse result back into the XML as 'stand-off' mark up.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Pre-Processing for Parsing </SectionTitle> <Paragraph position="0"> In Section 2 we showed how POS tag mark-up could be used to add to existing lexical resources. In this section we demonstrate how the XML approach allows for flexibility in the way data is converted from marked-up corpus material to parser input. This method enables 'messy' linguistic data to be rendered innocuous prior to parsing, thereby avoiding the need to make hand-written low-level additions to the grammar itself. One of the failings of the ANLT lexicon is in the subcategorisation of nouns: each noun has a zero subcategorisation entry but many nouns which optionally subcategorise a complement lack the appropriate entry. For example, the nouns use and management do not have entries with an of-PP subcategorisation frame so that in contexts where an of-PP is present, the correct parse will not be found. The case of of-PPs is a special one since we can assume that whenever of follows a noun it marks that noun's complement. We can encode this assumption in the layer of processing that converts the XML mark-up to the format required by the parser: an fsgmatch rule changes the value of the P attribute of a noun from NN to NNOF or from NNS to NNSOF whenever it is followed by of. By not adding morpheme entries for NNOF and NNSOF we ensure that word tag look-up will fail and the system will fall back on tag look-up using special entries for NNOF and NNSOF which have only an of-PP subcategorisation frame. In this way the parser will be forced to attach of-PPs following nouns as their complements.</Paragraph> <Paragraph position="1"> 3.1.2 Numbers, formulae, etc.</Paragraph> <Paragraph position="2"> Although we have stated that we only retain content word tags, in practice we also retain certain other tags for which we provide no morpheme entry in the morphological system so as to achieve tag rather than word tag look-up. For example, we retain the CD tag assigned to numerals and provide a general purpose entry for it so that sentences containing numerals can be parsed without needing lexical entries for them. We also use a pre-existing tokenisation component which recognises spelled out numbers to which the CD tag is also assigned: The program fsgmatch can be used to group words together into larger units using handwritten rules and small lexicons of 'multi-word' words.</Paragraph> <Paragraph position="3"> For the purposes of parsing, these larger units can be treated as words, so the grammar does not need to contain special rules for 'multi-word' words: The same technique can be used to package up a wide variety of formulaic expressions which would cause severe problems to most hand-crafted grammars. Thus all of the following 'words' have been identified using fsgmatch rules and can be passed to the parser as unanalysable chunks.3 The classification of the examples below as nouns reflects a working hypothesis that they can slot into the correct parse as noun phrases but there is room for experimentation since the conversion to parser input format can rewrite the tag in any way. It may turn out that they should be given a more general tag which corresponds to several major category types.</Paragraph> <Paragraph position="4"> It is important to note that our method of dividing the labour between pre-processing and parsing allows for experimentation to get the best possible balance. We are still developing our formula recognition subcomponent which has so far been entirely hand-coded using fsgmatch rules.</Paragraph> <Paragraph position="5"> We believe that it is more appropriate to do this hand-coding at the pre-processing stage rather than with the relatively unwieldy formalism of the ANLT grammar. Moreover, use of the XML paradigm might allow us to build a component that can induce rules for regular formulaic expressions thus reducing the need for hand-coding. 3.1.3 Dealing with tagger errors The tagger we use, ltpos, has a reported performance comparable to other state-of-the-art taggers. However, all taggers make errors, especially when used on data different from their training data. With the strategy outlined in this paper, where we only retain a subset of tags, many tagging errors will be harmless. However, content word tagging errors will be detrimental since the basic noun/verb/adjective/adverb distinction drives lexical look-up and only entries of the same category as the tag will be accessed. If we find that the tagger consistently makes the same error in a particular context, for example mistagging +ing nominalisations as verbs (VBG), then 3Futrelle et al. (1991) discuss tokenisation issues in biological texts.</Paragraph> <Paragraph position="6"> we can use fsgmatch rules to replace the tag in just those contexts. The new tag can be given a definition which is ambiguous between NN and VBG, thereby ensuring that a parse can be achieved.</Paragraph> <Paragraph position="7"> A second strategy that we are exploring involves using more than one tagger. Our current pipeline includes a call to Elworthy's (1994) CLAWS2 tagger. We encode the tags from this tagger as values of the attribute C2 on words: Many mistaggings can be found by searching for words where the two taggers disagree and they can be corrected in the mapping from XML format to parser input by assigning a new tag which is ambiguous between the two possibilities. For example, ltpos incorrectly tags the word bound in the following example as a noun but the CLAWS2 tagger correctly categorises it as a verb.</Paragraph> <Paragraph position="8"> a large JJ body NNOF of hemoglobin NN bound NNVVN to the ghost NN membrane NN We use xmlperl rules to map from XML to ANLT input and reassign these cases to the 'composite' tag NNVVN, which is given both a noun and a verb entry. This allows the correct parse to be found whichever tagger is correct. An alternative approach to the mistagging problem would be to use just one tagger which returns multiple tags and to use the relative probability of the tags to determine cases where a composite tag could be created in the mapping to parser input. Charniak et al. (forthcoming) reject a multiple tag approach when using a probabilistic context-free-grammar parser, but it is unclear whether their result is relevant to a hand-crafted grammar.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 An XML corpus </SectionTitle> <Paragraph position="0"> There are numerous advantages to working with XML tools. One general advantage is that we can add linguistic annotations in an entirely automatic and incremental fashion, so as to produce a heavily annotated corpus which may well prove useful to a number of researchers for a number of linguistic activities. In the work described here we have not used any domain specific information.</Paragraph> <Paragraph position="1"> However, it would clearly be possible to add domain specific information as further annotations using such resources as UMLS (UMLS, 2000). Indeed, we have begun to utilise UMLS and hope to improve the accuracy of the existing mark-up by incorporating lexical and semantic information.</Paragraph> <Paragraph position="2"> Since the annotations we describe are computed entirely automatically, it would be a simple matter to use our system to mark up new Medline data to increase the size of our corpus considerably.</Paragraph> <Paragraph position="3"> A heavily annoted corpus quickly becomes unreadable but if it is an XML annotated corpus then there are several tools to help visualise the data.</Paragraph> <Paragraph position="4"> For example, we use xmlperl to convert from XML to HTML to view the corpus in a browser.</Paragraph> </Section> </Section> class="xml-element"></Paper>