File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2902_metho.xml

Size: 13,978 bytes

Last Modified: 2025-10-06 14:10:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2902">
  <Title>Porting Statistical Parsers with Data-Defined Kernels</Title>
  <Section position="4" start_page="6" end_page="6" type="metho">
    <SectionTitle>
2 Data-Defined Kernels for Parsing
</SectionTitle>
    <Paragraph position="0"> Previous work has shown how data-defined kernels can be applied to the parsing task (Henderson and Titov, 2005). Given the trained parameters of a probabilistic model of parsing, the method defines a kernel over sentence-tree pairs, which is then used to rerank a list of candidate parses.</Paragraph>
    <Paragraph position="1"> In this paper, we focus on the TOP reranking kernel defined in (Henderson and Titov, 2005), which are closely related to Fisher kernels. The reranking task is defined as selecting a parse tree from the list of candidate trees (y1, . . ., ys) suggested by a probabilistic model P(x, y|^th), where ^th is a vector of model parameters learned during training the probabilistic model. The motivation for the TOP reranking kernel is given in (Henderson and Titov, 2005), but for completeness we note that the its feature extractor is given by:</Paragraph>
    <Paragraph position="3"> where v(x, yk, ^th) = log P(x, yk|^th) [?] logsummationtexttnegationslash=k P(x, yt|^th). The first feature reflects the score given to (x, yk) by the probabilistic model (relative to the other candidates for x), and the remaining features reflect how changing the parameters of the probabilistic model would change this score for (x, yk).</Paragraph>
    <Paragraph position="4"> The parameters ^th used in this feature extractor do not have to be exactly the same as the parameters trained in the probabilistic model. In general, we can first reparameterize the probabilistic model, producing a new model which defines exactly the same probability distribution as the old model, but with a different set of adjustable parameters. For example, we may want to freeze the values of some parameters (thereby removing them from ^th), or split some parameters into multiple cases (thereby duplicating their values in ^th). This flexibility allows the features used in the kernel method to be different from those used in training the probabilistic model. This can be useful for computational reasons, or when the kernel method is not solving exactly the same problem as the probabilistic model was trained for.</Paragraph>
  </Section>
  <Section position="5" start_page="6" end_page="8" type="metho">
    <SectionTitle>
3 Porting with Data-Defined Kernels
</SectionTitle>
    <Paragraph position="0"> In this paper, we consider porting a parser trained on a large amount of annotated data to a different domain where only a small amount of annotated data is available. We validate our method in two different  scenarios, transferring and focusing. Also we verify the hypothesis that addressing differences between the vocabularies of domains is more important than addressing differences between their syntactic structures. null</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.1 Transferring to a Different Domain
</SectionTitle>
      <Paragraph position="0"> In the transferring scenario, we are given just a probabilistic model which has been trained on a large corpus from a source domain. The large corpus is not available during porting, and the small corpus for the target domain is not available during training of the probabilistic model. This is the case of pure parser porting, because it only requires the source domain parser, not the source domain corpus. Besides this theoretical significance, this scenario has the advantage that we only need to train a single probabilistic parser, thereby saving on training time and removing the need for access to the large corpus once this training is done. Then any number of parsers for new domains can be trained, using only the small amount of annotated data available for the new domain.</Paragraph>
      <Paragraph position="1"> Our proposed porting method first constructs a data-defined kernel using the parameters of the trained probabilistic model. A large margin classifier with this kernel is then trained to rerank the top candidate parses produced by the probabilistic model. Only the small target corpus is used during training of this classifier. The resulting parser consists of the original parser plus a very computationally cheap procedure to rerank its best parses. Whereas training of standard large margin methods, like SVMs, isn't feasible on a large corpus, it is quite tractable to train them on a small target corpus.1 Also, the choice of the large margin classifier is motivated by their good generalization properties on small datasets, on which accurate probabilistic models are usually difficult to learn.</Paragraph>
      <Paragraph position="2"> We hypothesize that differences in vocabulary across domains is one of the main difficulties with parser portability. To address this problem, we propose constructing the kernel from a probabilistic model which has been reparameterized to better suit 1In (Shen and Joshi, 2003) it was proposed to use an ensemble of SVMs trained the Wall Street Journal corpus, but we believe that the generalization performance of the resulting classifier is compromised in this approach.</Paragraph>
      <Paragraph position="3"> the target domain vocabulary. As in other lexicalized statistical parsers, the probabilistic model we use treats words which are not frequent enough in the training set as 'unknown' words (Henderson, 2003).</Paragraph>
      <Paragraph position="4"> Thus there are no parameters in this model which are specifically for these words. When we consider a different target domain, a substantial proportion of the words in the target domain are treated as unknown words, which makes the parser only weakly lexicalized for this domain.</Paragraph>
      <Paragraph position="5"> To address this problem, we reparameterize the probability model so as to add specific parameters for the words which have high enough frequency in the target domain training set but are treated as unknown words by the original probabilistic model.</Paragraph>
      <Paragraph position="6"> These new parameters all have the same values as their associated unknown words, so the probability distribution specified by the model does not change.</Paragraph>
      <Paragraph position="7"> However, when a kernel is defined with this reparameterized model, the kernel's feature extractor includes features specific to these words, so the training of a large margin classifier can exploit differences between these words in the target domain. Expanding the vocabulary in this way is also justified for computational reasons; the speed of the probabilistic model we use is greatly effected by vocabulary size, but the large-margin method is not.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="8" type="sub_section">
      <SectionTitle>
3.2 Focusing on a Subdomain
</SectionTitle>
      <Paragraph position="0"> In the focusing scenario, we are given the large corpus from the source domain. We may also be given a parsing model, but as with other approaches to this problem we simply throw this parsing model away and train a new one on the combination of the source and target domain data. Previous work (Roark and Bacchiani, 2003) has shown that better accuracy can be achieved by finding the optimal re-weighting between these two datasets, but this issue is orthogonal to our method, so we only consider equal weighting.</Paragraph>
      <Paragraph position="1"> After this training phase, we still want to optimize the parser for only the target domain.</Paragraph>
      <Paragraph position="2"> Once we have the trained parsing model, our proposed porting method proceeds the same way in this scenario as in transferring. However, because the original training set already includes the vocabulary from the target domain, the reparameterization approach defined in the preceding section is not necessary so we do not perform it. This reparameter- null ization could be applied here, thereby allowing us to use a statistical parser with a smaller vocabulary, which can be more computationally efficient both during training and testing. However, we would expect better accuracy of the combined system if the same large vocabulary is used both by the probabilistic parser and the kernel method.</Paragraph>
    </Section>
    <Section position="3" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
3.3 Vocabulary versus Structure
</SectionTitle>
      <Paragraph position="0"> It is commonly believed that differences in vocabulary distributions between domains effects the ported parser performance more significantly than the differences in syntactic structure distributions.</Paragraph>
      <Paragraph position="1"> We would like to test this hypothesis in our framework. The probabilistic model (Henderson, 2003) allows us to distinguish between those parameters responsible for the distributions of individual vocabulary items, and those parameters responsible for the distributions of structural decisions, as described in more details in section 4.2. We train two additional models, one which uses a kernel defined in terms of only vocabulary parameters, and one which uses a kernel defined in terms of only structure parameters.</Paragraph>
      <Paragraph position="2"> By comparing the performance of these models and the model with the combined kernel, we can draw conclusion on the relative importance of vocabulary and syntactic structures for parser portability.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="8" end_page="9" type="metho">
    <SectionTitle>
4 An Application to a Neural Network
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
Statistical Parser
</SectionTitle>
      <Paragraph position="0"> Data-defined kernels can be applied to any kind of parameterized probabilistic model, but they are particularly interesting for latent variable models.</Paragraph>
      <Paragraph position="1"> Without latent variables (e.g. for PCFG models), the features of the data-defined kernel (except for the first feature) are a function of the counts used to estimate the model. For a PCFG, each such feature is a function of one rule's counts, where the counts from different candidates are weighted using the probability estimates from the model. With latent variables, the meaning of the variable (not just its value) is learned from the data, and the associated features of the data-defined kernel capture this induced meaning. There has been much recent work on latent variable models (e.g. (Matsuzaki et al., 2005; Koo and Collins, 2005)). We choose to use an earlier neural network based probabilistic model of parsing (Henderson, 2003), whose hidden units can be viewed as approximations to latent variables. This parsing model is also a good candidate for our experiments because it achieves state-of-the-art results on the standard Wall Street Journal (WSJ) parsing problem (Henderson, 2003), and data-defined kernels derived from this parsing model have recently been used with the Voted Perceptron algorithm on the WSJ parsing task, achieving a significant improvement in accuracy over the neural network parser alone (Henderson and Titov, 2005).</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
4.1 The Probabilistic Model of Parsing
</SectionTitle>
      <Paragraph position="0"> The probabilistic model of parsing in (Henderson, 2003) has two levels of parameterization. The first level of parameterization is in terms of a history-based generative probability model. These parameters are estimated using a neural network, the weights of which form the second level of parameterization. This approach allows the probability model to have an infinite number of parameters; the neural network only estimates the bounded number of parameters which are relevant to a given partial parse. We define our kernels in terms of the second level of parameterization (the network weights).</Paragraph>
      <Paragraph position="1"> A history-based model of parsing first defines a one-to-one mapping from parse trees to sequences of parser decisions, d1,..., dm (i.e. derivations). Henderson (2003) uses a form of left-corner parsing strategy, and the decisions include generating the words of the sentence (i.e. it is generative). The probability of a sequence P(d1,..., dm) is then decomposed into the multiplication of the probabilities of each parser decision conditioned on its history of previous decisions PiP(di|d1,..., di[?]1).</Paragraph>
    </Section>
    <Section position="3" start_page="8" end_page="9" type="sub_section">
      <SectionTitle>
4.2 Deriving the Kernel
</SectionTitle>
      <Paragraph position="0"> The complete set of neural network weights isn't used to define the kernel, but instead reparameterization is applied to define a third level of parameterization which only includes the network's output layer weights. As suggested in (Henderson and Titov, 2005) use of the complete set of weights doesn't lead to any improvement of the resulting reranker and makes the reranker training more computationally expensive.</Paragraph>
      <Paragraph position="1"> Furthermore, to assess the contribution of vocabulary and syntactic structure differences (see sec- null tion 3.3), we divide the set of the parameters into vocabulary parameters and structural parameters. We consider the parameters used in the estimation of the probability of the next word given the history representation as vocabulary parameters, and the parameters used in the estimation of structural decision probabilities as structural parameters. We define the kernel with structural features as using only structural parameters, and the kernel with vocabulary features as using only vocabulary parameters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML