File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1034_intro.xml

Size: 4,080 bytes

Last Modified: 2025-10-06 14:01:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1034">
  <Title>XML-Based Data Preparation for Robust Deep Parsing</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The field of parsing technology currently has two distinct strands of research with few points of contact between them. On the one hand, there is thriving research on shallow parsing, chunking and induction of statistical syntactic analysers from treebanks; and on the other hand, there are systems which use hand-crafted grammars which provide both syntactic and semantic coverage.</Paragraph>
    <Paragraph position="1"> 'Shallow' approaches have good coverage on corpus data, but extensions to semantic analysis are still in a relative infancy. The 'deep' strand of research has two main problems: inadequate coverage, and a lack of reliable techniques to select the correct parse. In this paper we describe on-going research which uses hybrid technologies to address the problem of inadequate coverage of a 'deep' parsing system. In Section 2 we describe how we have modified an existing hand-crafted grammar's look-up procedure to utilise part-of-speech (POS) tag information, thereby ameliorating the lexical information shortfall. In Section 3 we describe how we combine a variety of existing NLP tools to pre-process real data up to the point where a hand-crafted grammar can start to be useful. The work described in both sections is enabled by the use of an XML processing paradigm whereby the corpus is converted to XML with analysis results encoded as XML annotations. In Section 4 we report on an experiment with a random sample of 200 sentences which gives an approximate measure of the increase in performance we have gained.</Paragraph>
    <Paragraph position="2"> The work we describe here is part of a project which aims to combine statistical and symbolic processing techniques to compute lexical semantic relationships, e.g. the semantic relations between nouns in complex nominals. We have chosen the medical domain because the field of medical informatics provides a relative abundance of pre-existing knowledge bases and ontologies.</Paragraph>
    <Paragraph position="3"> Our efforts so far have focused on the OHSUMED corpus (Hersh et al., 1994) which is a collection of Medline abstracts of medical journal papers.1 While the focus of the project is on semantic issues, a prerequisite is a large, reliably annotated corpus and a level of syntactic process1Sager et al. (1994) describe the Linguistic String Project's approach to parsing medical texts.</Paragraph>
    <Paragraph position="4"> ing that supports the computation of semantics.</Paragraph>
    <Paragraph position="5"> The computation of 'grammatical relations' from shallow parsers or chunkers is still at an early stage (Buchholz et al., 1999, Carroll et al., 1998) and there are few other robust semantic processors, and none in the medical domain. We have therefore chosen to re-use an existing hand-crafted grammar which produces compositionally derived underspecified logical forms, namely the wide-coverage grammar, morphological analyser and lexicon provided by the Alvey Natural Language Tools (ANLT) system (Carroll et al. 1991, Grover et al. 1993). Our immediate aim is to increase coverage up to a reasonable level and thereafter to experiment with ranking the parses, e.g. using Briscoe and Carroll's (1993) probabilistic extension of the ANLT software.</Paragraph>
    <Paragraph position="6"> We use XML as the preprocessing mark-up technology, specifically the LT TTT and LT XML tools (Grover et al., 2000; Thompson et al., 1997).</Paragraph>
    <Paragraph position="7"> In the initial stages of the project we converted the OHSUMED corpus into XML annotated format with mark-up that encodes word tokens, POS tags, lemmatisation information etc. The research reported here builds on that mark-up in a further stage of pre-processing prior to parsing. The XML paradigm has proved invaluable throughout.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML