File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/85/e85-1024_intro.xml

Size: 4,299 bytes

Last Modified: 2025-10-06 14:04:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="E85-1024">
  <Title>A PROBABILISTIC PARSER</Title>
  <Section position="2" start_page="0" end_page="166" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> In this paper we present an overview of one part of the work currently being carried out at the Unit for Computer Research on the English Language (UCREL) in the University of Lancaster, under SERC research grant number GR/C/47700. This work involves the automatic syntactic analysis or parsing of the LOB corpus, using the statistical or constituent-likelihood (CL) grammar ideas of Atwell (1983). The work is based on the grammatical tagging of the LOB corpus, both as providing a partially analysed text and because of the techniques used in assigning tags. We therefore begin by briefly describing this earlier project.</Paragraph>
    <Paragraph position="1"> The grammatical tagging of the LOB corpus is described in detail elsewhere (see, for example, Leech, Garside and Atwell 1983, Marshall 1983, Beale 1985), but in essence there are three stages. The first stage takes the original corpus, on which a certain amount of pre-editing (both automatic and manual) has been performed. It assigns to each word in the corpus a set of possible tags, and it is assumed that the correct tag is in this set. The set of possible tags is chosen without at this stage considering the context in which the word appears, and the choice is made by using an ordered set of decision rules, the most commonly used of which (in about 65-70% of cases) is to look the word up in a dictionary of some 7000 words.</Paragraph>
    <Paragraph position="2"> The third stage involves looking at those cases where the first stage has resulted in more than one tag being assigned to a word.</Paragraph>
    <Paragraph position="3"> In this case we calculate the probability of each possible sequence of ambiguous tags, and the most likely sequence is chosen as the correct one. In most cases the probability of a sequence of tags is calculated by multiplying together the pairwise probabilities of one tag following another, and these pairwise probabilities were derived from a statistical analysis of co-occurrence of tags in the tagged Brown corpus (Francis and Kucera 1964).</Paragraph>
    <Paragraph position="4"> A further stage was later inserted between the two stages described above. This stage involves the ability to look for patterns of sequences of words and putative tags assigned by the first stage, and to modify the sets of tags assigned to words. This enables various problematical situations to be resolved or clarified in order to improve the disambiguating ability of the third stage.</Paragraph>
    <Paragraph position="5"> After the third stage (when the appropriate tag will have been automatically selected some 96,5% of the time), the remaining errors are removed by a manual post-editing phase.</Paragraph>
    <Paragraph position="6"> The fundamental idea on which our syntactic analysis is based, originally formulated in Atwell (1983), is that the general principles behind the tagging system could be used at the parsing level. Thus a first stage of parsing could be to look up a tag in a dictionary to derive a set of possible constituents (or &amp;quot;hypertags&amp;quot;) containing this tag. Similarly, in the third stage, the probability of any particular constituent being constructed out of a particular set of constituents or word- null classes at the next lower level could be used to disambiguate a set of constituents posited at the first stage. To this end some 2000 sentences from the LOB corpus have been manually parsed, and the results stored as a &amp;quot;treebank&amp;quot; or database of information on the frequency of occurrence of possible grammatical structures. Thus, for each possible &amp;quot;mother&amp;quot; constituent, there will be stored a set of sequences of daughter constituents or word-classes, together with their frequencies.</Paragraph>
    <Paragraph position="7"> The second stage generalises to a search for particular syntactic patterns which are recognisable in context, and the resolution of which will improve the accuracy of the third stage. We develop these ideas in the remainder of the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML