File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/p92-1024_intro.xml

Size: 4,813 bytes

Last Modified: 2025-10-06 14:05:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="P92-1024">
  <Title>Development and Evaluation of a Broad-Coverage Probabilistic Grammar of English-Language Computer Manuals</Title>
  <Section position="4" start_page="0" end_page="185" type="intro">
    <SectionTitle>
2. Approach
</SectionTitle>
    <Paragraph position="0"> Our approach to grammar development consists of the following 4 elements:  * Selection of application domain. * Development of a manually-bracketed corpus (treebank) of the domain.</Paragraph>
    <Paragraph position="1"> * Creation of a grammar with a large coverage of a  blind test set of treebanked text. Statistical modeling with the goal that the correct parse be assigned maximum probability by the stochastic grammar.</Paragraph>
    <Paragraph position="2"> We now discuss each of these elements in more detail. Application domain: It would be a good first step toward our goal of covering general English to demonstrate that we can develop a parser that has a high parsing accuracy for sentences in, say, any book listed in Books In Print concerning needlework; or in any wholesale footwear catalog; or in any physics journal. The selected domain of focus should allow the acquisition of a naturally-occuring large corpus (at least a few million words) to allow for realistic evaluation of performance and  adequate amounts of data to characterize the domain so that new test data does not surprise system developers with a new set of phenomena hitherto unaccounted for in the grammar.</Paragraph>
    <Paragraph position="3"> We selected the domain of computer manuals. Besides the possible practical advantages to being able to assign valid parses to the sentences in computer manuals, reasons for focusing on this domain include the very broad but not unrestricted range of sentence types and the availability of large corpora of computer manuals. We amassed a corpus of 40 million words, consisting of several hundred computer manuals. Our approach in attacking the goal of developing a grammar for computer manuals is one of successive approximation. As a first approximation to the goal, we restrict ourselves to sentences of word length 7 - 17, drawn from a vocabulary consisting of the 3000 most frequent words (i.e. fully inflected forms, not lcmmas) in a 600,000-word subsection of our corpus. Approximately 80% of the words in the 40-million-word corpus are included in the 3000-word vocabulary. We have available to us about 2 million words of sentences completely covered by the 3000-word vocabulary. A lexicon for this 3000-word vocabulary was completed in about 2 months.</Paragraph>
    <Paragraph position="4"> Treebank: A sizeable sample of this corpus is hand-parsed (&amp;quot;treebanked&amp;quot;). By definition, the hand parse (&amp;quot;treebank parse&amp;quot;) for any given sentence is considered  its &amp;quot;correct parse&amp;quot; and is used to judge the grammar's parse. To fulfill this role, treebank parses are constructed as &amp;quot;skeleton parses,&amp;quot; i.e. so that all obvious decisions are made as to part-of-speech labels, constituent boundaries and constituent labels, but no decisions are made which are problematic, controversial, or of which the treebankers are unsure. Hence the term &amp;quot;skeleton parse&amp;quot;: clearly not all constituents will always figure in a tree-bank parse, but the essential ones always will. In practice, these are quite detailed parses in most cases. The 18 constituent labels 2 used in the Lancaster treebank are listed and defined in Table 1. A sampling of the approximately 200 part-of-speech tags used is provided in Table 2.</Paragraph>
    <Paragraph position="5"> To date, roughly 420,000 words (about 35,000 sentences) of the computer manuals material have been treebanked by a team at the University of Lancaster, England, under Professors Geoffrey Leech and Roger Garside. Figure 1 shows two sample parses selected at random from the Lancaster Treebank.</Paragraph>
    <Paragraph position="6"> The treebank is divided into a training subcorpus and a test subcorpus. The grammar developer is able to inspect the training dataset at will, but can never see the test dataset. This latter restriction is, we feel, crucial for making progress in grammar development. The purpose of a grammar is to correctly analyze previously unseen sentences. It is only by setting it to this task that its true accuracy can be ascertained. The value of a large bracketed training corpus is that it allows the grammarian to obtain quickly a very large 3 set of sentences that 2Actually there are 18 x 3 = 54 labels, as each label L has variants LA: for a first conjunct, and L-{- for second and later conjuncts, of type L: e.g. \[N\[Ng~ the cause NSz\] and \[Nq- the appropriate action N-k\]N\].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML