File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0708_intro.xml

Size: 3,790 bytes

Last Modified: 2025-10-06 14:07:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0708">
  <Title>MDL-based DCG Induction for NP Identification</Title>
  <Section position="3" start_page="61" end_page="62" type="intro">
    <SectionTitle>
2 Overview
</SectionTitle>
    <Paragraph position="0"> Our learner is probabilistic, and starts with a DCG-based language model M0. Parameters are initially estimated from parsed corpora, annotated in terms of the non-terminal set used by the DCG. It incrementally processes each sentence s~ in the list of sentences So... 's# ... sn. If a sentence s# cannot be generated (the grammar contained within the model lacks the required rules), we need to find a new model with a high, non-zero posterior probability given the sentences so... s# seen so far. Our (for computational reasons, necessarily suboptimal) approach selects such a model by&amp;quot; carrying out a local search over the space of models with a non-zero posterior probability, given all sentences see so far. We use a MDL-based prior to help us compute a posterior probability. Analogously to Pereira and Schabes (P+S) \[23\], we also constrain the search using parsed corpora. Unlike P+S, we not only use parsed corpora to constrain parameter estimation, we also use it to constrain model selection. We replace M0 with the newly constructed (locally) maximal a posterior model and after processing all sentences in this incremental manner, terminate with a model that generates all sentences seen in the training set.</Paragraph>
    <Paragraph position="1"> Key aspects of our approach are: Incremental learning. We only construct rules necessary to parse sentences in training set. This reduces the computational burden and enables us to learn with grammars that use large (&gt; 30) feature sets. By contrast, batch approaches that compile-out all rules that can be expressed with a fixed category set and with rules limited to some length can only deal with far smaller feature sets, thereby preventing induction of realistic grammars.</Paragraph>
    <Paragraph position="2"> Initialisation with a model containing manually written rules, with parameters estimated from parsed corpora. This alleviates some of the pitfalls of local search and, by definition, makes estimation faster (our initial model is already a reasonable estimate).</Paragraph>
    <Paragraph position="3"> Lari and Young demonstrated this point when they used an Hidden Markov Model as an approximation of a Stochastic Context Free Grammar SCFG \[19\].</Paragraph>
    <Paragraph position="4"> Note that in general any manually written grammar  will undergenerate, so there is still a need for new rules to be induced.</Paragraph>
    <Paragraph position="5"> * Ability to induce 'fair' models from raw text. We do not select models solely on the basis of likelihood (when training material is limited, such models tend to overfit); instead, we select models in terms of their MDL-based prior probability and likelihood. MDL-based estimation usually reduces overfitting (since, with limited training material, we select a model that fits th~ training material well, but not too well) \[22\]. * Learning from positive-only data. We do not require negative examples, nor do we require human intervention. This enables us to transparently use grammar learning as part of standard text parsing.</Paragraph>
    <Paragraph position="6"> * Use of parsed corpora allows us to induce models that encode semantic and pragmatic preferences. Raw text is known to underconstrain induction \[13\] and even with an MDL-based prior, when training upon raw text, we would be unlikely to estimate a model whose hidden variables (grammar rules) resulted in linguistically plausible derivations. Parsed corpora supplies some of this missing information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML