File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1615_metho.xml

Size: 19,650 bytes

Last Modified: 2025-10-06 14:10:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1615">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Domain Adaptation with Structural Correspondence Learning</Title>
  <Section position="5" start_page="121" end_page="122" type="metho">
    <SectionTitle>
3 Structural Correspondence Learning
</SectionTitle>
    <Paragraph position="0"> Structural correspondence learning involves a source domain and a target domain. Both domains have ample unlabeled data, but only the source domain has labeled training data. We refer to the task for which we have labeled training data as the supervised task. In our experiments, the supervised task is part of speech tagging. We require that the input x in both domains be a vector of binary features from a finite feature space. The first step of SCL is to define a set of pivot features on the unlabeled data from both domains. We then use these pivot features to learn a mapping th from the original feature spaces of both domains to a shared, low-dimensional real-valued feature space. A high inner product in this new space indicates a high degree of correspondence.</Paragraph>
    <Paragraph position="1"> During supervised task training, we use both the transformed and original features from the source domain. During supervised task testing, we use the both the transformed and original features from the target domain. If we learned a good mapping th, then the classifier we learn on the source domain will also be effective on the target domain.</Paragraph>
    <Paragraph position="2"> The SCL algorithm is given in Figure 3, and the remainder of this section describes it in detail.</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
3.1 Pivot Features
</SectionTitle>
      <Paragraph position="0"> Pivot features should occur frequently in the unlabeled data of both domains, since we must estimate their covariance with non-pivot features accurately, but they must also be diverse enough to adequately characterize the nuances of the supervised task. A good example of this tradeoff are determiners in PoS tagging. Determiners are good pivot features, since they occur frequently in any domain of written English, but choosing only determiners will not help us to discriminate between nouns and adjectives. Pivot features correspond to the auxiliary problems of Ando and Zhang (2005a).</Paragraph>
      <Paragraph position="1"> In section 2, we showed example pivot features of type &lt;the token on the right&gt;.</Paragraph>
      <Paragraph position="2"> We also use pivot features of type &lt;the token on the left&gt; and &lt;the token in the middle&gt;. In practice there are many thousands of pivot features, corresponding to instantiations of these three types for frequent words in both domains. We choose m pivot features, which we index with lscript.</Paragraph>
    </Section>
    <Section position="2" start_page="121" end_page="122" type="sub_section">
      <SectionTitle>
3.2 Pivot Predictors
</SectionTitle>
      <Paragraph position="0"> From each pivot feature we create a binary classification problem of the form &amp;quot;Does pivot feature lscript occur in this instance?&amp;quot;. One such example is &amp;quot;Is &lt;the token on the right&gt; required?&amp;quot; These binary classification problems can be trained from the unlabeled data, since they merely represent properties of the input. If we represent our features as a binary vector x, we can solve these problems using m linear predictors.</Paragraph>
      <Paragraph position="1"> flscript(x) = sgn( ^wlscript x), lscript = 1 ... m Note that these predictors operate on the original feature space. This step is shown in line 2 of Figure 3. Here L(p,y) is a real-valued loss function for binary classification. We follow Ando and Zhang (2005a) and use the modified Huber loss.</Paragraph>
      <Paragraph position="2"> Since each instance contains features which are totally predictive of the pivot feature (the feature itself), we never use these features when making the binary prediction. That is, we do not use any feature derived from the right word when solving a right token pivot predictor.</Paragraph>
      <Paragraph position="3"> The pivot predictors are the key element in SCL.</Paragraph>
      <Paragraph position="4"> The weight vectors ^wlscript encode the covariance of the non-pivot features with the pivot features. If the weight given to the z'th feature by the lscript'th  pivot predictor is positive, then feature z is positively correlated with pivot feature lscript. Since pivot features occur frequently in both domains, we expect non-pivot features from both domains to be correlated with them. If two non-pivot features are correlated in the same way with many of the same pivot features, then they have a high degree of correspondence. Finally, observe that ^wlscript is a linear projection of the original feature space onto R.</Paragraph>
    </Section>
    <Section position="3" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
3.3 Singular Value Decomposition
</SectionTitle>
      <Paragraph position="0"> Since each pivot predictor is a projection onto R, we could create m new real-valued features, one for each pivot. For both computational and statistical reasons, though, we follow Ando and Zhang (2005a) and compute a low-dimensional linear approximation to the pivot predictor space. Let W be the matrix whose columns are the pivot predictor weight vectors. Now let W = UDV T be the singular value decomposition of W, so that th = UT[1:h,:] is the matrix whose rows are the top left singular vectors of W.</Paragraph>
      <Paragraph position="1"> The rows of th are the principal pivot predictors, which capture the variance of the pivot predictor space as best as possible in h dimensions. Furthermore, th is a projection from the original feature space onto Rh. That is, thx is the desired mapping to the (low dimensional) shared feature representation. This is step 3 of Figure 3.</Paragraph>
    </Section>
    <Section position="4" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
3.4 Supervised Training and Inference
</SectionTitle>
      <Paragraph position="0"> To perform inference and learning for the supervised task, we simply augment the original feature vector with features obtained by applying the mapping th. We then use a standard discriminative learner on the augmented feature vector. For training instance t, the augmented feature vector will contain all the original features xt plus the new shared features thxt. If we have designed the pivots well, then th should encode correspondences among features from different domains which are important for the supervised task, and the classifier we train using these new features on the source domain will perform well on the target domain.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="122" end_page="122" type="metho">
    <SectionTitle>
4 Model Choices
</SectionTitle>
    <Paragraph position="0"> Structural correspondence learning uses the techniques of alternating structural optimization (ASO) to learn the correlations among pivot and non-pivot features. Ando and Zhang (2005a) describe several free paramters and extensions to ASO, and we briefly address our choices for these here. We set h, the dimensionality of our low-rank representation to be 25. As in Ando and Zhang (2005a), we observed that setting h between 20 and 100 did not change results significantly, and a lower dimensionality translated to faster run-time.</Paragraph>
    <Paragraph position="1"> We also implemented both of the extensions described in Ando and Zhang (2005a). The first is to only use positive entries in the pivot predictor weight vectors to compute the SVD. This yields a sparse representation which saves both time and space, and it also performs better. The second is to compute block SVDs of the matrix W, where one block corresponds to one feature type. We used the same 58 feature types as Ratnaparkhi (1996).</Paragraph>
    <Paragraph position="2"> This gave us a total of 1450 projection features for both semisupervised ASO and SCL.</Paragraph>
    <Paragraph position="3"> We found it necessary to make a change to the ASO algorithm as described in Ando and Zhang (2005a). We rescale the projection features to allow them to receive more weight from a regularized discriminative learner. Without any rescaling, we were not able to reproduce the original ASO results. The rescaling parameter is a single number, and we choose it using heldout data from our source domain. In all our experiments, we rescale our projection features to have average L1 norm on the training set five times that of the binary-valued features.</Paragraph>
    <Paragraph position="4"> Finally, we also make one more change to make optimization faster. We select only half of the ASO features for use in the final model. This is done by running a few iterations of stochastic gradient descent on the PoS tagging problem, then choosing the features with the largest weightvariance across the different labels. This cut in half training time and marginally improved performance in all our experiments.</Paragraph>
  </Section>
  <Section position="7" start_page="122" end_page="123" type="metho">
    <SectionTitle>
5 Data Sets and Supervised Tagger
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
5.1 Source Domain: WSJ
</SectionTitle>
      <Paragraph position="0"> We used sections 02-21 of the Penn Treebank (Marcus et al., 1993) for training. This resulted in 39,832 training sentences. For the unlabeled data, we used 100,000 sentences from a 1988 subset of the WSJ.</Paragraph>
    </Section>
    <Section position="2" start_page="122" end_page="123" type="sub_section">
      <SectionTitle>
5.2 Target Domain: Biomedical Text
</SectionTitle>
      <Paragraph position="0"> For unlabeled data we used 200,000 sentences that were chosen by searching MEDLINE for abstracts pertaining to cancer, in particular genomic varia- null similarly to each other for classification, but differently from words on the right (positive valued). The projection distinguishes nouns from adjectives and determiners in both domains. tions and mutations. For labeled training and testing purposes we use 1061 sentences that have been annotated by humans as part of the Penn BioIE project (PennBioIE, 2005). We use the same 561sentence test set in all our experiments. The part-of-speech tag set for this data is a superset of the Penn Treebank's including the two new tags HYPH (for hyphens) and AFX (for common post-modifiers of biomedical entities such as genes).</Paragraph>
      <Paragraph position="1"> These tags were introduced due to the importance of hyphenated entities in biomedical text, and are used for 1.8% of the words in the test set. Any tagger trained only on WSJ text will automatically predict wrong tags for those words.</Paragraph>
    </Section>
    <Section position="3" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
5.3 Supervised Tagger
</SectionTitle>
      <Paragraph position="0"> Since SCL is really a method for inducing a set of cross-domain features, we are free to choose any feature-based classifier to use them. For our experiments we use a version of the discriminative online large-margin learning algorithm MIRA (Crammer et al., 2006). MIRA learns and outputs a linear classification score, s(x,y; w) = w f(x,y), where the feature representation f can contain arbitrary features of the input, including the correspondence features described earlier. In particular, MIRA aims to learn weights so that the score of correct output, yt, for input xt is separated from the highest scoring incorrect outputs2, with a margin proportional to their Hamming losses. MIRA has been used successfully for both sequence analysis (McDonald et al., 2005a) and dependency parsing (McDonald et al., 2005b).</Paragraph>
      <Paragraph position="1"> As with any structured predictor, we need to factor the output space to make inference tractable.</Paragraph>
      <Paragraph position="2"> We use a first-order Markov factorization, allowing for an efficient Viterbi inference procedure.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="123" end_page="123" type="metho">
    <SectionTitle>
6 Visualizing th
</SectionTitle>
    <Paragraph position="0"> In section 2 we claimed that good representations should encode correspondences between words like &amp;quot;signal&amp;quot; from MEDLINE and &amp;quot;investment&amp;quot; from the WSJ. Recall that the rows of th are projections from the original feature space onto the real line. Here we examine word features under these projections. Figure 4 shows a row from the matrix th. Applying this projection to a word gives a real value on the horizontal dashed line axis. The words below the horizontal axis occur only in the WSJ. The words above the axis occur only in MEDLINE. The verticle line in the middle represents the value zero. Ticks to the left or right indicate relative positive or negative values for a word under this projection. This projection discriminates between nouns (negative) and adjectives (positive). A tagger which gives high positive weight to the features induced by applying this projection will be able to discriminate among the associated classes of biomedical words, even when it has never observed the words explicitly in the WSJ source training set.</Paragraph>
  </Section>
  <Section position="9" start_page="123" end_page="125" type="metho">
    <SectionTitle>
7 Empirical Results
</SectionTitle>
    <Paragraph position="0"> All the results we present in this section use the MIRA tagger from Section 5.3. The ASO and structural correspondence results also use projection features learned using ASO and SCL. Section 7.1 presents results comparing structural correspondence learning with the supervised baseline and ASO in the case where we have no labeled data in the target domain. Section 7.2 gives results for the case where we have some limited data in the target domain. In this case, we use classifiers as features as described in Florian et al. (2004).</Paragraph>
    <Paragraph position="1"> Finally, we show in Section 7.3 that our SCL PoS  tagger improves the performance of a dependency parser on the target domain.</Paragraph>
    <Section position="1" start_page="124" end_page="125" type="sub_section">
      <SectionTitle>
7.1 No Target Labeled Training Data
</SectionTitle>
      <Paragraph position="0"> For the results in this section, we trained a structural correspondence learner with 100,000 sentences of unlabeled data from the WSJ and 100,000 sentences of unlabeled biomedical data.</Paragraph>
      <Paragraph position="1"> We use as pivot features words that occur more than 50 times in both domains. The supervised baseline does not use unlabeled data. The ASO baseline is an implementation of Ando and Zhang (2005b). It uses 200,000 sentences of unlabeled MEDLINE data but no unlabeled WSJ data. For ASO we used as auxiliary problems words that occur more than 500 times in the MEDLINE unlabeled data.</Paragraph>
      <Paragraph position="2"> Figure 5(a) plots the accuracies of the three models with varying amounts of WSJ training data. With one hundred sentences of training data, structural correspondence learning gives a 19.1% relative reduction in error over the supervised baseline, and it consistently outperforms both baseline models. Figure 5(b) gives results for 40,000 sentences, and Figure 5(c) shows corresponding significance tests, with p &lt; 0.05 being significant. We use a McNemar paired test for labeling disagreements (Gillick and Cox, 1989).</Paragraph>
      <Paragraph position="3"> Even when we use all the WSJ training data available, the SCL model significantly improves accuracy over both the supervised and ASO baselines. The second column of Figure 5(b) gives unknown word accuracies on the biomedical data.</Paragraph>
      <Paragraph position="4">  Of thirteen thousand test instances, approximately three thousand were unknown. For unknown words, SCL gives a relative reduction in error of 19.5% over Ratnaparkhi (1996), even with 40,000 sentences of source domain training data.</Paragraph>
    </Section>
    <Section position="2" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
7.2 Some Target Labeled Training Data
</SectionTitle>
      <Paragraph position="0"> In this section we give results for small amounts of target domain training data. In this case, we make use of the out-of-domain data by using features of the source domain tagger's predictions in training and testing the target domain tagger (Florian et al., 2004). Though other methods for incorporating small amounts of training data in the target domain were available, such as those proposed by Chelba and Acero (2004) and by Daum'e III and Marcu (2006), we chose this method for its simplicity and consistently good performance. We use as features the current predicted tag and all tag bigrams in a 5-token window around the current token.</Paragraph>
      <Paragraph position="1"> Figure 6(a) plots tagging accuracy for varying amounts of MEDLINE training data. The two horizontal lines are the fixed accuracies of the SCL WSJ-trained taggers using one thousand and forty thousand sentences of training data. The five learning curves are for taggers trained with varying amounts of target domain training data. They use features on the outputs of taggers from section 7.1. The legend indicates the kinds of features used in the target domain (in addition to the standard features). For example, &amp;quot;40k-SCL&amp;quot; means that the tagger uses features on the outputs of an SCL source tagger trained on forty thousand sentences of WSJ data. &amp;quot;nosource&amp;quot; indicates a target tagger that did not use any tagger trained on the source domain. With 1000 source domain sentences and 50 target domain sentences, using SCL tagger features gives a 20.4% relative reduction in error over using supervised tagger features and a 39.9% relative reduction in error over using no source features.</Paragraph>
      <Paragraph position="2"> Figure 6(b) is a table of accuracies for 500 target domain training sentences, and Figure 6(c) gives corresponding significance scores. With 1000 source domain sentences and 500 target domain sentences, using supervised tagger features gives no improvement over using no source features. Using SCL features still does, however.</Paragraph>
    </Section>
    <Section position="3" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
7.3 Improving Parser Performance
</SectionTitle>
      <Paragraph position="0"> We emphasize the importance of PoS tagging in a pipelined NLP system by incorporating our SCL  ent part of speech taggers tagger into a WSJ-trained dependency parser and and evaluate it on MEDLINE data. We use the parser described by McDonald et al. (2005b). That parser assumes that a sentence has been PoS-tagged before parsing. We train the parser and PoS tagger on the same size of WSJ data.</Paragraph>
      <Paragraph position="1">  the sentences using the PoS tags output by our source domain supervised tagger, the SCL tagger from subsection 7.1, and the gold PoS tags. All of the differences in this figure are significant according to McNemar's test. The SCL tags consistently improve parsing performance over the tags output by the supervised tagger. This is a rather indirect method of improving parsing performance with SCL. In the future, we plan on directly incorporating SCL features into a discriminative parser to improve its adaptation properties.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML