File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1048_metho.xml

Size: 10,508 bytes

Last Modified: 2025-10-06 14:13:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1048">
  <Title>Prediction of Lexicalized Tree Fragments in Text</Title>
  <Section position="3" start_page="824" end_page="824" type="metho">
    <SectionTitle>
2. USING PARTIAL STRUCTURES
</SectionTitle>
    <Paragraph position="0"> The preceding section has given evidence that adding probabilities to existing grammars in several formalisms is less than optimal since significant predictive relationships are necessarily ignored. The obvious solution is to enrich the grammars to include more information. To do this, we need variable sized units in our database, with varying terms of description, including adjacency relationships and dependency relationships. That is, given the unpredictable distribution of information in text, we would like to have a more flexible approach to representing the recurrent relations in a corpus. To address this need, we have been collecting a database of partial structures extracted from the Wall Street Journal corpus, in a way designed to record recurrent information over a wide range of size and terms of the description.</Paragraph>
    <Section position="1" start_page="824" end_page="824" type="sub_section">
      <SectionTitle>
Extracting Partial Structures The database of partial
</SectionTitle>
      <Paragraph position="0"> structures is built up from the words in the corpus, by successively adding larger structures, after augmenting the corpus with the analysis provided by an unsupervised parser. The larger structures found in this way are then entered into the permanent database of structures only if a relation recurs with a frequency above a given threshold. When a structure does not meet the frequency threshold, it is generalized until it does.</Paragraph>
      <Paragraph position="1"> The descriptive relationships admitted include: * basic lexical features  - spelling - part-of-speech - lemma - major category (maximal projection) * dependency relations - depends on * adjacency relations - precedes Consider an example from the following sentence from the a training corpus of 20 million words of the Wall Street Journal. (1) Reserve board rules have put banks between a  rock and a hard place The first order description of a word consists of its basic lexical features, i.e. the word spelling, its part of speech, its lemma, and its major category. Looking at the word banks, we have as description  Assuming that we require at least two instances for a partial description to be entered into the database, none of these three descriptions qualify for the database. Therefore we must abstract away, using an arbitrarily defined abstraction path. First we abstract from the spelling to the lemma. This move admits two relations (since they are now frequent enough)</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="824" end_page="824" type="metho">
    <SectionTitle>
PRUNED STRUCTURES
</SectionTitle>
    <Paragraph position="0"> (precedes (put, VB,put/V, VG) (,NN,bank/N,NG)) (depends (put,VB,put/V, VG) (,NN,bank/N,NG)) units are selected in using language depends on a variety of factors, including meaning, subject matter, speaking situation, style, interlocutor and so on. Of course, demonstrating that this intuition is valid remains for future work.</Paragraph>
    <Paragraph position="1"> The set of partial trees can be used directly in an analogical parser, as described in Hindle 1992. In the parser, we are not concerned with estimating probabilities, but rather with finding the structure which best matches the current parser state, where a match is better the more specific its description is.</Paragraph>
    <Paragraph position="2"> The third relation is still too infrequent, so we further generalize to (precedes (,NN,,NG) (between,IN,between/I,PG)) a relation which is amply represented (3802 occurrences). The process is iterated, using the current abstracted description of each word, adding a level of description, then generalizing when below the frequency threshold. Since each level in elaborating the description adds information to each word, it can only reduce the counts, but never increase them. This process finds a number of recurrent partial structures, including between a rock and a hard place (3 occurrences in 20 million words), and \[vpput \[NP distance\] \[pp between\]\] (4 occurrences).</Paragraph>
    <Paragraph position="3"> General Caveats There is of course considerable noise introduced by the errors in analysis that the parser makes. There are several arbitrary decisions made in collecting the database. The level of the threshold is arbitrarily set at 3 for all structures. The sequence of generalization is arbitrarily determined before the training. And the predicates in the description are arbitrarily selected. We would like to have better motivation for all these decisions.</Paragraph>
    <Paragraph position="4"> It should be emphasized that while the set of descriptive terms used in the collection of the partial structure database allows a more flexible description of the corpus than simple ngrams, CFG's or some dependency descriptions, it nevertheless is also restrictive. There are many predictive relationships that can not be described. For example, parallelism, reference, topic-based or speaker-based variation, and so on.</Paragraph>
    <Paragraph position="5"> Motivation The underlying reason for developing a database of partial trees is not primarily for the language modeling task of predicting the next word. Rather the partial-tree database is motivated by the intuition that partial trees are are the locus of other sorts of linguistic information, for example, semantic or usage information. Our use of language seems to involve the composition of variably sized partially described units expressed in terms of a variety of predicates (only some of which are included in our database). Which</Paragraph>
  </Section>
  <Section position="5" start_page="824" end_page="824" type="metho">
    <SectionTitle>
3. ENHANCING A TRIGRAM MODEL
</SectionTitle>
    <Paragraph position="0"> The partial structure database provides more information than an ngram description, and thus can be used to enhance an ngram model. To explore how to use the best available information in a language model, we turn to a trigram model of Wall Street Journal text. The problem is put into relief by focusing on those cases where the trigram model fails, that is, where the observed trigram condition (w-2, w_l) does not occur in the training corpus.</Paragraph>
    <Paragraph position="1"> In the current test, we randomly assigned each sentence from a 2 million word sample of WSJ text to either the test or training set. This unrealistically minimizes the rate of unseen conditions, since typically the training and test are selected from disjoint documents (see Church and Gale 1991). On the other hand, since the training is only a million words, the trigrams are undertrained. In general, the rate of unseen conditions will vary with the domain to be modeled and the size of training corpus, but it will not (in realistic languages) be eliminated. In this test, 26% (258665/997811) of the bigrams did not appear in the test, and thus it is necessary to backoff from the trigram model.</Paragraph>
    <Paragraph position="2"> We will assume that a trigram model is sufficiently effective at preediction in those cases where the conditioning bigram has been observed in training, and will focus on the problem of what to do when the conditioning bigram has not appeared in the training. In a standard backoff model, we would look to estimate Pr(wolw_l). Here we want to consider a second predictor derived from our database of partial structures. The particular predictor we use is the lemma of the word that w-1 depends on, which we will call G(W_l). In the example discussed above, the first (standard) predictor for the word between is the preceding word banks and the second predictor for the word between is G(banks), which in this case is put/v. We want to choose among two predictors, w-i and G(w_l).</Paragraph>
    <Paragraph position="3"> In general, if we have two conditions, Ca and CCb and we want to find the probability of the next word given these conditions.</Paragraph>
    <Paragraph position="4"> Intuitively, we would like to choose the predictor C'i for which the predicted distribution of w differs most from the unigram distribution. Various measures are possible; here we con- null sider one, which Resnik (1993) calls selectional preference, namely the relative entropy between the posterior distribution Pr(w\]C) and the prior distribution Pr(w). We'll label this measure IS, where</Paragraph>
    <Paragraph position="6"> In the course of processing sentence (1), we need an estimate of Pr(between\]put banks). Our training corpus does not include the collocation put banks, so no help is available from trigrams, therefore we backoff to a bigram model, choosing the bigram predictor with maximum IS. The maximum IS is for put/V (G(w_x)) rather than for w-1 (banks) itself, so G(w_l) is used as predictor, giving a logprob estimate of -10.2 rather than -13.1.</Paragraph>
    <Paragraph position="7"> The choice of G(w_ 1) as predictor here seems to make sense, since we are willing to believe that there is a complementation relation between put/V and its second complement between.</Paragraph>
    <Paragraph position="8"> Of course, the choice is not always so intuitively appealing.</Paragraph>
    <Paragraph position="9"> When we go on to predict the next word, we need an estimate of Pr(albanks between). Again, our training corpus does not include the collocation banks between, so no help is available from trigrams. In this case, the maximum IS is for banks rather than between, so we use banks to predict a rather than between, giving a logprob estimate of-5.6 rather than -7.10.</Paragraph>
    <Paragraph position="10"> Overall, however, the two predictors can be combined to improve the language model, by always choosing the predictor with higher IS score.</Paragraph>
    <Paragraph position="11"> As shown in Table 4, this slightly improves the logprob for our test set over either predictor independently. However, Table 4 also shows that a simple strategy of chosing the raw bigram first and the G(w_l) bigram when there is no information available is slightly better. In a more general situation, where we have a set of different descriptions of the same condition, the IS score provides a way to choose the best predictor.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML