File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1306_metho.xml

Size: 11,877 bytes

Last Modified: 2025-10-06 14:07:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1306">
  <Title>Sample Selection for Statistical Grammar Induction</Title>
  <Section position="4" start_page="46" end_page="46" type="metho">
    <SectionTitle>
3 Grammar Induction
</SectionTitle>
    <Paragraph position="0"> The degree of difficulty of the task of learning a grammar from data depends on the quantity and quality of the training supervision. When the training corpus consists of a larg e reservoir of fully annotated parse trees, it is possible to directly extract a grammar based on these parse trees. The success of recent high-quality parsers (Charniak, 1997; Collins, 1997) relies on the availability of such treebank corpora.</Paragraph>
    <Paragraph position="1"> To work with smaller training corpora, the learning system would require even more information about the examples than their syntactic parse trees. For instance, Hermjakob and Mooney (1997) have described a learning system that can build a deterministic shift-reduce parser from a small set of training examples with the aid of detailed morphological, syntactical, and semantic knowledge databases and step-by-step guidance from human experts.</Paragraph>
    <Paragraph position="2"> The induction task becomes more challenging as the amount of supervision in the training data and background knowledge decreases. To compensate for the missing information, the learning process requires heuristic search to find locally optimal grammars. One form of partially supervised data might specify the phrasal boundaries without specifying their labels by bracketing each constituent unit with a pair of parentheses (McNaughton, 1967). For example, the parse tree for the sentence '~Several fund managers expect a rough market this morning before prices stablize.&amp;quot; is labeled as &amp;quot;((Several fund managers) (expect ((a rough market) (this morning)) (before (prices stabilize))).)&amp;quot; As shown in Pereira and Schabes (1992), an essentially unsupervised learning algorithm such as the Inside-Outside re-estimation process (Baker, 1979; Lari and Young, 1990) can be modified to take advantage of these bracketing constraints.</Paragraph>
    <Paragraph position="3"> For our sample selection experiment, we chose to work under the more stringent condition of partially supervised training data, as described above, because our ultimate goal is to minimize the amount of annotation done by humans in terms of both the number of sentences and the number of brackets within the sentences. Thus, the quality of our induced grammars should not be compared to those extracted from a fully annotated training corpus. The learning algorithm we use is a variant of the Inside-Outside algorithm that induces grammars expressed in the Probabilistic Lexicalized Tree Insertion Grammar representation (Schabes and Waters, 1993; Hwa, 1998). This formalism's Context-free equivalence and its lexicalized representation make the training process efficient and computationally plausible.</Paragraph>
  </Section>
  <Section position="5" start_page="46" end_page="48" type="metho">
    <SectionTitle>
4 Selective Sampling Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="46" end_page="46" type="sub_section">
      <SectionTitle>
Functions
</SectionTitle>
      <Paragraph position="0"> In this paper, we propose two uncertainty-based evaluation functions for estimating the training utilities of the candidate sentences.</Paragraph>
      <Paragraph position="1"> The first is a simple heuristic that uses the length of a sentence to estimate uncertainties. The second function computes uncertainty in terms of the entropy of the parse trees that the hypothesis-grammar generated for the sentence.</Paragraph>
    </Section>
    <Section position="2" start_page="46" end_page="46" type="sub_section">
      <SectionTitle>
4.1 Sentence Length
</SectionTitle>
      <Paragraph position="0"> Let us first consider a simple evaluation function that estimates the training utility of a candidate without consulting the current hypothesis-grammar, G. The function ften(s,G) coarsely approximates the uncertainty of a candidate sentence s with its length: flen(S, G) = length(s).</Paragraph>
      <Paragraph position="1"> The intuition behind this function is based on the general observation that longer sentences tend to have complex structures and introduce more opportunities for ambiguous parses. Since the scoring only depends on sentence lengths, this naive evaluation function orders the training pool deterministically regardless of either the current state of the grammar or the annotation of previous training sentences. This approach has one major advantage: it is easy to compute and takes negligible processing time.</Paragraph>
    </Section>
    <Section position="3" start_page="46" end_page="48" type="sub_section">
      <SectionTitle>
4.2 Tree Entropy
</SectionTitle>
      <Paragraph position="0"> Sentence length is not a very reliable indicator of uncertainty. To measure the uncertainty of a sentence more accurately, the evaluation function must base its estimation on the outcome of testing the sentence on the hypothesis-grammar. When a stochastic grammar parses a sentence, it generates a set of possible trees and associates a likelihood value with each. Typically, the most likely tree is taken to be the best parse for the sentence. null We propose an evaluation function that considers the probabilities of all parses. The  set of probabilities of the possible parse trees for a sentence defines a distribution that indicates the grammar's uncertainty about the structure of the sentence. For example, a uniform distribution signifies that the grammar is at its highest uncertainty because all the parses are equally likely; whereas a distribution resembling an impulse function suggests that the grammar is very certain because it finds one parse much more likely than all others. To quantitatively characterize a distribution, we compute its entropy.</Paragraph>
      <Paragraph position="1"> Entropy measures the uncertainty of assigning a value to a random variable over a distribution. Informally speaking, it is the expected number of bits needed to encode the assignment. A higher entropy value signifies a higher degree of uncertainty. At the highest uncertainty, the random variable is assigned one of n values over a uniform distribution, and the outcome would require log2 (n) bits to encode.</Paragraph>
      <Paragraph position="2"> More formally, let V be a discrete random variable that can take any possible outcome in set V. Let p(v) be the density function</Paragraph>
      <Paragraph position="4"> Further details about the properties of entropy can be found in textbooks on information theory (Cover and Thomas, 1991).</Paragraph>
      <Paragraph position="5"> Determining the parse tree for a sentence from a set of possible parses can be viewed as assigning a value to a random variable. Thus, a direct application of the entropy definition to the probability distribution of the parses for sentence s in grammar G computes its tree entropy, TE(s, G), the expected number of bits needed to encode the distribution of possible parses for s. Note that we cannot compare sentences of different lengths by their entropy.</Paragraph>
      <Paragraph position="6"> For two sentences of unequal lengths, both with uniform distributions, the entropy of the longer one is higher. To normalize for sentence length, we define an evaluation function that computes the similarity between the actual probability distribution and the uniform distribution for a sentence of that length. For a sentence s of length l, there can be at most 0(2 l) equally likely parse trees and its maxireal entropy is 0(l) bits (Cover and Thomas, 1991). Therefore, we define the evaluation function, fte(s, G) to be the tree entropy divided by the sentence length.</Paragraph>
      <Paragraph position="8"> We now derive the expression for TE(s, G).</Paragraph>
      <Paragraph position="9"> Suppose that a sentence s can be generated by a grammar G with some non-zero probability, Pr(s \[ G). Let V be the set of possible parses that G generated for s. Then the probability that sentence s is generated by G is the sum of the probabilities of its parses. That is:</Paragraph>
      <Paragraph position="11"> Note that Pr(v \[ G) reflects the probability of one particular parse tree, v, in the grammar out of all possible parse trees for all possible sentences that G accepts. But in order to apply the entropy definition from above, we need to specify a distribution of probabilities for the parses of sentence s such that</Paragraph>
      <Paragraph position="13"> the correct parse tree out of a set of possible parses for s according to grammar G. It is also the density function, p(v), for the distribution (i.e., the probability of assigning v to a random variable V). Using Bayes Rule and noting that Pr(v, s \[ G) = Pr(v \[ G) (because the existence of tree v implies the existence of sentence s), we get:</Paragraph>
      <Paragraph position="15"> Replacing the generic density function term in the entropy definition, we derive the expression for TE(s, G), the tree entropy of s:</Paragraph>
      <Paragraph position="17"> ming technique of computing Inside Probabilities (Lari and Young, 1990), we can efficiently compute the probability of the sentence, Pr(s I G). Similarly, the algorithm can be modified to compute the quantity ~\]v~vPr( v I G)log2(Pr(v I G)) (see Appendix A).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="48" end_page="48" type="metho">
    <SectionTitle>
5 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> To determine the effectiveness of selecting training examples with the two proposed evaluation functions, we compare them against a baseline of random selection (frand(S, G) = rand()). The task is to induce grammars from selected sentences in the Wall Street Journal (WSJ) corpus, and to parse unseen test sentences with the trained gr~.mmars. Because the vocabulary size (and the grammar size by extension) is very large, we have substituted the words with their part-of-speech tags to avoid additional computational complexity in training the grammar. After replacing the words with part-of-speech tags, the vocabulary size of the corpus is reduced to 47 tags. We repeat the study for two different candidate-pool sizes. For the first experiment, we assume that there exists an abundant sup-ply of unlabeled data. Based on empirical observations (as will be shown in Section 6), for the task we are considering, the induction algorithm typically reaches its asymptotic limit after training with 2600 sentences; therefore, it is sufficient to allow for a candidate-pool size of U = 3500 unlabeled WSJ sentences. In the second experiment, we restrict the size of the candidate-pool such that U contains only 900 unlabeled sentences. This experiment studies how the paucity of training data affects the evaluation functions.</Paragraph>
    <Paragraph position="1"> For both experiments, each of the three evaluation functions: frand, ften, and fte, is applied to the sample selection learning algorithm shown in Figure 1, where concept C is the current hypothesis-grammar G, and L, the set of labeled training data; initially consists of 100 sentences. In every iteration, n = 100 new sentences are picked from U to be added to L, and a new C is induced from the updated L. After the hypothesis-grammar is updated, it is tested. The quality of the induced grammax is judged by its ability to generate correct parses for unseen test sentences. We use the consistent bracketing metric (i.e., the percentage of brackets in the proposed parse not crossing brackets of the true parse) to measure parsing accuracy 1. To ensure the staffstical significance of the results, we report the average of ten trials for each experiment 2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML