File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0105_metho.xml

Size: 11,818 bytes

Last Modified: 2025-10-06 14:14:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0105">
  <Title>Probabilistic Parsing of Unrestricted English Text, With a Highly-Detailed Grammar</Title>
  <Section position="5" start_page="24" end_page="26" type="metho">
    <SectionTitle>
5. TOWAI~DS RADICALLY EXPANDING TRAINING-SET SIZE VIA
TI:tEEBANK CONVERSION
5.1. Introduction
</SectionTitle>
    <Paragraph position="0"> As an additioaal means of improving the accuracy of our parser, we have been working towards effecting a dramatic increase in the size of our trai~ing treebank, via treebank conversion techniques.</Paragraph>
    <Paragraph position="1"> We employ a statistical method for converting treebank from a less-detailed formatwand we have chosen the IBM/Lancaster Treebank (Eyes and Leech, 1993; Garside and McEnery, 1993) as a first representative of such treeba~k~--to a more-detailed format, that of the ATR/Lancaster Treebank.</Paragraph>
    <Paragraph position="2"> There has been very little previous work on treebanlC/ conversion. (Hughes et al., 1995) describe an effort to b~n_d-annotate text using the tagging schemes employed in various different treebanks, as a prelirnln~ry to attempting to learn, in a way to be determined, how to convert a corpus automatically from one style of tagging markup to another. (Wang et at., 1994) take on the problem of converting treeb~n~ conforming to their English grammar into a format conforming to a later version of the same grammar, and report a conversion accuracy of some 96% on a 141,000-word test set. They employ a heuristic which scores source-treebank/target-treebank parse pairs based essentially on the percentage of identically-placed brackets in the two parses. However, their target grammar 19 generates only 17 parses on average per sentence of test data. Although they exhibit no parses with respect to their grammars, it can be assumed that they feature only rudimentary tag and non-terminal vocabularies.</Paragraph>
    <Paragraph position="3"> The problem we face in learning to convert IBM/Lancaster Treebank parses into ATl~/Lancaster Treebank parses is rather more difficult than this. For instance, as noted in 3.1, the Parse Base of the ATR English Grammar, which generates the parses of the ATl~/Lancaster Treebank, is 1.76, which means that on average, the Grammar generates about 200 parses for 10-word sentence; 2000 parses for a IS-word sentence, and 70,000 parses for a 20-word sentence. Further, far from featuring a rudimentary set of lexicat tags and non-termlnal node labels, the ATl~/Lancaster Treebauk utilizes ~gaud presumably their source grammar as well  rougbJy 3,000 lexica\] tags and about 1,100 d~erent non-terminal node labels, sdeg as mentioned in 2.1. F~re 2 shows a parse for a sample sentence, first from the IBM/Lancaster Treeb~-k, and next from the ATR/Lancaster Treebank. An impression of the di~cnlty of the treeb~nk conversion task undertaken here can be gained by closely contrasting the two parses of this Figure.</Paragraph>
    <Paragraph position="4"> 143,837 words included in the IBM/Lancaster Treeb~n~--35,575 words of Associated Press newswire and 108,262 words of Canadian Hansard le~slative proceedh~s--were treebanked with respect to the ATR English Grammar, in the exact same manner as the data in the ATl%/Lancaster Treeb~nk. We will refer to the IBM/Lancaster Treeb~-k version of this data as the parallel corpus. As a preliminary step to treeb~k conversion, we aligned the parallel and ATI% corpora. 87.3% of the parallel data--125,530 words--aligned essentially perfectly, and for the work reported here, we decided to operate only on this satisfactorily-aligned dat&amp;</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
5.2. The Treebank Conversion Problem
</SectionTitle>
      <Paragraph position="0"> Ideally, our treebank-conversion models should take full advantage of data in the full target treebank (i.e. the full ATR/Lancaster Treeb~,~k) as well as the parallel corpus. A direct model of the conditional probability of the ATI% parse given the source-treebank parse, p(AIF), uses only data in the parallel corpus. A more e~cient use of data would be to build two models: one to estimate the likelihood of an ATR parse, p(A), given raw text; the other to estimate p(FIA ). Then, 2degactrua/ly, rules names with respect to the ATR EngKsh Grammar; d. 2.1 v</Paragraph>
      <Paragraph position="2"> using Bayes' rule, one would write p(AIF ) as: p(A\[F) ~ p(F\[A)p(A) (i) The model for p(FIA ) uses only the parallel corpus, but the model for p(A) makes full use of the data in the ATR treebank.</Paragraph>
      <Paragraph position="3"> In our software environment, this approach would require constructing a feature-based grammar for the source treebanlc. A simpler, but probably adequate approach would combine the two models p(A) and p(AIF) heuristically, using p(AIF ) to rescore the N best parses found by the model p(A). The top-r~ed candidate from the rescored parses is selected as the ATR parse. This way takes advantage of both data sets, though not as etBciently as the Bayesian approach. We have chosen to explore the problem using an even simpler approach: ignoring the ATR treebank and working only within the model for p(AIF). This yields lower bounds on potential accuracy at low cost. We also considered filtering the parses considered by the ATR parser to ensure they satisfied certain constraints implied by the source-treeb~n~ parse. This proved to be impractical because the constr~;nts were not &amp;quot;hard&amp;quot;, i.e. the exact circ~,mstances in which they should be applied were di$cult to determine. Instead, we relied on the models to learn the constraints and the conditions for their application directly from the data. However, the issue of applying such constraints is specific to the two treeb~nkR being used; there may well be cases in which such constr~iuts are not hard to develop.</Paragraph>
      <Paragraph position="4"> The source-treebank-to--ATR conversion model was built using the same system described in Sections 2 and 3, the sole difference being that the question l~nguage was extended to allow for questions about the source treebank. Since the topology of the parallel tree may be very different from that of the ATR parse tree, it is not obvious what the analog of a node in the ATR tree is. We chose to use the &amp;quot;least enclosing&amp;quot; node: that is, the lowest (non-pretermiual) node in the parallel tree which spans (at least) the set of words spanned by the node in the ATR parse.</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
5.3. Decision-Tree Questions Asked
</SectionTitle>
      <Paragraph position="0"> We ask all decision-tree questions in our treeb~n~-conversion models that we do normally in parsing with the ATR English Grammar. 21 We then add further questions which ask about the source--treebank parse for the sentence being processed.</Paragraph>
      <Paragraph position="1"> We use an extremely basic set of question-language functions in querying the structure of the source-treebank parse. These permit us to ask about the least-enclosing node, and about children and parents of this source-treebank-parse node, or of its children or parents, to any level of structure. What we can ask about a node in the source-treeb~-b parse is either what its non-terminal label is, or how many children it has. In addition, we are able to ask whether there is a constituent in the source--treebank parse with the identical span as a given node of an ATR parse; and if so, what its non-terminal label is, or how many children it has. Similarly, we can ask about constituents that &amp;quot;cross&amp;quot; a given node of an ATR parse. Finally. we can ask about the tag of any word in the source-treebank parse.</Paragraph>
      <Paragraph position="2"> There is much farther that we can go in exploiting the information in the source-treebank parse to aid in predicting the ATR parse. For instance, we can define and query grammatical relations such as clausal subject and main verb. We can even define and query notions like &amp;quot;headword&amp;quot; with respect to the source--treebank parse, although this would involve appreciable work. Furthermore, carrying over to the source--treebank environ_ment question types that seem helpful when asked about ATR parses will not be di$cult.</Paragraph>
      <Paragraph position="3">  actly match the single parse in the treeb~ulc, for a 6,556-word test set. &amp;quot;Treeb~ulc-conversion&amp;quot; models are trained on 1\].8,489 ~mning words of ATR/Lancaster Treeb~uk, together with aligned IBM/Lancaster Treeb~. &amp;quot;Parser&amp;quot; models are trained on 676,401 r-nn;ngwords of ATR/Lancaster Treebank alone.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="26" end_page="28" type="metho">
    <SectionTitle>
5.4. Exper;mental Results
</SectionTitle>
    <Paragraph position="0"> Evaluation Methodology We evaluate trsebank conversion to ATR-Treebank format in the same way as we evaluate the parser when it is trained in the normal ma-ner (cf. 4.1), except that test data consists of ATR-Treebank-format documents of which we also possess aligned source treebank (in this case: IBM/Lancaster-Treebank) versions. In the performance results cited below, however, we show exact match only with the single correct parse of the test treebank, rather than with any one of the correct parses indicated in the &amp;quot;golden standard&amp;quot; version of the test set.</Paragraph>
    <Paragraph position="1"> Experimental Results Table 5 displays exact-match parsing results for a normal 6,556--word test set 22. Crucially, the amount of tr~.~n;ng data here, 118,489 words, is only 17.5% as large as for the models of Tables 1-2. Considering the simplicity of the approach, we think these results constitute a proof of principle for the idea of treebank conversion. They indicate that we can build treebank conversion models of accuracy comparable to the current parser using much less data. Of course, the results here do not include models used in tagging. The treebank conversion models tag with an accuracy of 62.8%. A detailed examination of those models shows that the syntactic models are better than the parser's, while the semi-tic models are worse. This is to be expected, because the IBM/Lancaster ~I~eeb~k cont~in.q a great deal of relevant information about the syntax: but not so much about the semantics of the sentences they cont~i~ One idea, therefore, is to utilize large-scale treeb~uk conversion in the tagging domain to overcome the problem noted in 4.2, that even with 94% accuracy at strictly syntactic tagging (i.e. effectively, on tagging with our 440-tag syntax-only tag subset), approximately one word is syntactically mistagged every two sentences, leading to an increased error rate at exact-syntactic-match parsing. A second direction which suggests itself is to pursue our scaled-down approach to treebank conversion, but with more tr~;u;ng data than we have used so far. Third, we may decide to implement the more laborious two-model approach desribed in 5.2. 23 Overall, we expect that conversion models which take full advantage of the existing database as well as of the parallel corpus as outlined above should produce data of high enough quality to use as training data for our parser.</Paragraph>
    <Paragraph position="2"> 2~i.e. not for a &amp;quot;golden staudard&amp;quot; test set as des~ibed in 4.1, in which all parses are indicated for each test sentence 23It seems worth mentioning that future large-scale treebank-creation efforts would probably benefit from constmcting parallel data with respect to other large ~eeb~k% right from the start.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML