File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1094_metho.xml
Size: 19,538 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1094"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 748-754, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Composition of Conditional Random Fields for Transfer Learning</Title> <Section position="4" start_page="748" end_page="749" type="metho"> <SectionTitle> 3 Dynamic CRFs </SectionTitle> <Paragraph position="0"> Dynamic conditional random fields (Sutton et al., 2004) extend linear-chain CRFs in the same way that dynamic Bayes nets (Dean & Kanazawa, 1989) extend HMMs.</Paragraph> <Paragraph position="1"> Rather than having a single monolithic state variable, DCRFs factorize the state at each time step by an undirected model.</Paragraph> <Paragraph position="2"> Formally, DCRFs are the class of conditionally-trained undirected models that repeat structure and parameters over a sequence. If we denote by Phc(yc,t,xt) the repetition of clique c at time step t, then a DCRF defines the probability of a label sequence s given the input x as:</Paragraph> <Paragraph position="4"> where as before, the clique templates are parameterized in terms of input features as</Paragraph> <Paragraph position="6"> Exact inference in DCRFs can be performed by forward-backward in the cross product state space, if the cross-product space is not so large as to be infeasible.</Paragraph> <Paragraph position="7"> Otherwise, approximate methods must be used; in our experience, loopy belief propagation is often effective in grid-shaped DCRFs. Even if inference is performed monolithically, however, a factorized state representation is still useful because it requires much fewer parameters than a fully-parameterized linear chain in the cross-product state space.</Paragraph> <Paragraph position="8"> Sutton et al. (2004) introduced the factorial CRF (FCRF), in which the factorized state structure is a grid (Figure 1). FCRFs were originally applied to jointly performing interdependent language processing tasks, in particularpart-of-speechtaggingandnoun-phrasechunking. The previous work on FCRFs used joint training, which requires a single training set that is jointly labeled for all tasks in the cascade. For many tasks such data is not readily available, for example, labeling syntactic parse trees for every new Web extraction task would be prohibitively expensive. In this paper, we train the subtasks separately, which allows us the freedom to use large, standard data sets for well-studied subtasks such as named-entity recognition.</Paragraph> </Section> <Section position="5" start_page="749" end_page="749" type="metho"> <SectionTitle> 4 Alternatives for Learning Transfer </SectionTitle> <Paragraph position="0"> In this section, we enumerate several classes of methods for learning transfer, based on the amount and type of interaction they allow between the tasks. The principal differences between methods are whether the individual tasks are performed separately in a cascade or jointly; whether a single prediction from the lower task is used, or several; and what kind of confidence information is shared between the subtasks.</Paragraph> <Paragraph position="1"> The main types of transfer learning methods are: 1. Cascadedtrainingandtesting. Thisisthetraditional approach in NLP, in which the single best prediction from the old task is used in the new task at training and test time. In this paper, we show that allowing richer interactions between the subtasks can benefit performance.</Paragraph> <Paragraph position="2"> 2. Joint training and testing. In this family of ap null proaches, a single model is trained to perform all the subtasks at once. For example, in Caruana's work on multitask learning (Caruana, 1997), a neural networkistrainedtojointlyperformmultipleclassifica- null tion tasks, with hidden nodes that form a shared representation among the tasks. Jointly trained methods allow potentially the richest interaction between tasks, butcanbeexpensiveinbothcomputationtime required for training and in human effort required to label the joint training data.</Paragraph> <Paragraph position="3"> Exact inference in a jointly-trained model, such as forward-backward in an FCRF, implicitly considers all possible subtask predictions with confidence given by the model's probability of the prediction. However, for computational efficiency, we can use inference methods such as particle filtering and sparse message-passing (Pal et al., 2005), which communicate only a limited number of predictions between sections of the model.</Paragraph> <Paragraph position="4"> All of the pairwise cliques also have links to the observed input, although we omit these edges in the diagram for clarity.</Paragraph> <Paragraph position="5"> 3. Joint testing with cascaded training. Although a joint model over all the subtasks can have better performance, it is often much more expensive to train. One approach for reducing training time is cascaded training, which provides both computational efficiency and the ability to reuse large, standard training sets for the subtasks. At test time, though, the separately-trained models are combined into a single model, so that joint decoding can propagate information between the tasks.</Paragraph> <Paragraph position="6"> Even with cascaded training, it is possible to preserve some uncertainty in the subtask's predictions. Instead of using only a single subtask prediction for training the main task, the subtask can pass upwards a lattice of likely predictions, each of which is weighted by the model's confidence. This has the advantage of making the training procedure more similar to the joint testing procedure, in which all possible subtask predictions are considered.</Paragraph> <Paragraph position="7"> In the next two sections, we describe and evaluate joint testing with cascaded training for transfer learning in linear-chain CRFs. At training time, only the best subtask prediction is used, without any confidence information. Even though this is perhaps the simplest jointtesting/cascaded-training method, we show that it still leads to a significant gain in accuracy.</Paragraph> </Section> <Section position="6" start_page="749" end_page="750" type="metho"> <SectionTitle> 5 Composition of CRFs </SectionTitle> <Paragraph position="0"> In this section we briefly describe how we combine individually-trained linear-chain CRFs using composition. For a series of N cascaded tasks, we train individual CRFs separately on each task, using the prediction of the previous CRF as a feature. We index the CRFs by i, so that the state of CRF i at time t is denoted sit.</Paragraph> <Paragraph position="1"> Thus, the feature functions for CRF i are of the form fik(sit[?]1,sit,si[?]1t ,x,t)--that is, they depend not only on the observed input x and the transition (sit[?]1 - sit) but</Paragraph> <Paragraph position="3"> last names, honorifics, etc.</Paragraph> <Paragraph position="4"> wt appears to be part of a time followed by a dash wt appears to be part of a time preceded by a dash wt appears to be part of a date the above wt is the word at position t, Tt is the POS tag at position t, w ranges over all words in the training data, and T ranges over all Penn Treebank part-of-speech tags. The &quot;appears to be&quot; features are based on hand-designed regular expressions that can span several tokens. also on the state si[?]1t of the previous transducer. We also add all conjunctions of the input features and theprevious transducer'sstate, forexample, a featurethat is 1 if the current state is SPEAKERNAME, the previous transducer predicted PERSONNAME, and the previous word is Host:.</Paragraph> <Paragraph position="5"> To perform joint decoding at test time, we form the composition of the individual CRFs, viewed as finite-state transducers. That is, we define a new linear-chain CRF whose state space is the cross product of the states of the individual CRFs, and whose transition costs are the sum of the transition costs of the individual CRFs.</Paragraph> <Paragraph position="6"> Formally, let S1,S2,...SN be the state sets and L1,L2,...LN the weights of the individual CRFs. Then the state set of the combined CRF isS = S1xS2x...x SN. We will denote weight k in an individual CRF i by lik and a single feature by fik(sit[?]1,sit,si[?]1t ,x,t). Then for s [?] S, the combined model is given by:</Paragraph> <Paragraph position="8"> The graphical model for the combined model is the factorial CRF in Figure 1.</Paragraph> </Section> <Section position="7" start_page="750" end_page="752" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="750" end_page="751" type="sub_section"> <SectionTitle> 6.1 Email Seminar Announcements </SectionTitle> <Paragraph position="0"> We evaluate joint decoding on a collection of 485 e-mail messages announcing seminars at Carnegie Mellon University, gathered by Freitag (1998). The messages are annotated with the seminar's starting time, ending time, location, and speaker. This data set has been the sub-ject of much previous work using a wide variety of learning methods. Despite all this work, however, the best reported systems have precision and recall on speaker names of only about 70%--too low to use in a practical system. This task is so challenging because the messages are written by many different people, who each have different ways of presenting the announcement information. Because the task includes finding locations and per-son names, the output of a named-entity tagger is a useful feature. It is not a perfectly indicative feature, however,becausemanyotherkindsofpersonnamesappearin null seminar announcements--for example, names of faculty hosts, departmental secretaries, and sponsors of lecture series. For example, the token Host: indicates strongly both that what follows is a person name, but that person is not the seminars' speaker.</Paragraph> <Paragraph position="1"> Even so, named-entity predictions do improve performance on this task. We use the predictions from a CRF named-entity tagger that we trained on the standard CoNLL 2003 English data set. The CoNLL 2003 data set consists of newswire articles from Reuters labeled as either people, locations, organizations, or miscellaneous entities. It is much larger than the seminar announcements data set. While the named-entity data contains 203,621 tokens for training, the seminar announcements data set contains only slightly over 60,000 training tokens. null Previous work on the seminars data has used a onefield-per-document evaluation. That is, for each field, the CRF selects a single field value from its Viterbi path, and this extraction is counted as correct if it exactly matches any of the true field mentions in the document. We compute precision and recall following this convention, and report their harmonic mean F1. As in the previous work,</Paragraph> <Paragraph position="3"> we use 10-fold cross validation with a 50/50 training/test split. We use a spherical Gaussian prior on parameters with variance s2 = 0.5.</Paragraph> <Paragraph position="4"> We evaluate whether joint decoding with cascaded training performs better than cascaded training and decoding. Table 2 compares cascaded and joint decoding for CRFs with other previous results from the literature.1 The features we use are listed in Table 1. Although previous work has used very different feature sets, we include ano-transferCRFbaselinetoassesstheimpactoftransfer from the CoNLL data set. All the CRF runs used exactly the same features.</Paragraph> <Paragraph position="5"> On the most challenging fields, location and speaker, cascaded transfer is more accurate than no transfer at all, and joint decoding is more accurate than cascaded decoding. In particular, for speaker, we see an error reduction of 8% by using joint decoding over cascaded. The difference in F1 between cascaded and joint decoding is statistically significant for speaker (paired t-test; p = 0.017) but only marginally significant for location (p = 0.067).</Paragraph> <Paragraph position="6"> Our results are competitive with previous work; for example, on location, the CRF is more accurate than any of the existing systems.</Paragraph> <Paragraph position="7"> Examining the trained models, we can observe both errors made by the general-purpose named entity tagger, and how they can be corrected by considering the seminars labels. In newswire text, long runs of capitalized words are rare, often indicating the name of an entity. In emailannouncements, runsofcapitalizedwordsarecommon in formatted text blocks like: Location: Baker Hall Host: Michael Erdmann In this type of situation, the named entity tagger often mistakes Host: for the name of an entity, especially becausethewordprecedingHost isalsocapitalized. Onone of the cross-validated testing sets, of 80 occurrences of data. In the above wt is the word at position t, and w ranges over all words in the training data.</Paragraph> <Paragraph position="8"> thewordHost:,thenamed-entitytaggerlabels52assome kind of entity. When joint decoding is used, however, only 20 occurrences are labeled as entities. Recall that the joint model uses exactly the same weights as the cascaded model; the only difference is that the joint model takes into account information about the seminar labels when choosing named-entity labels. This is an example of how domain-specific information from the main task can improve performance on a more standard, general-purpose subtask.</Paragraph> <Paragraph position="9"> Figure 2 shows the difference in performance between joint and cascaded decoding as a function of training set size. Cascaded decoding with the full training set of 242 emails performs equivalently to joint decoding on only 181 training instances, a 25% reduction in the training set.</Paragraph> <Paragraph position="10"> In summary, even with a simple cascaded training method on a well-studied data set, joint decoding performs better for transfer than cascaded decoding.</Paragraph> </Section> <Section position="2" start_page="751" end_page="752" type="sub_section"> <SectionTitle> 6.2 Entity Recognition </SectionTitle> <Paragraph position="0"> In this section we give results on joint decoding for transfer between two newswire data sets with similar but overlapping label sets. The Automatic Content Extraction (ACE) data set is another standard entity recognition data and cascaded training on the ACE entity recognition task. GPE means geopolitical entities, such as countries. Joint decoding helps most on the harder nominal (common noun) references. These results were obtained using a small subset of the training set.</Paragraph> <Paragraph position="1"> set, containing 422 stories from newspaper, newswire, and broadcast news. Unlike the CoNLL entity recognition data set, in which only proper names of entities are annotated, the ACE data includes annotation both for named entities like United States, and also nominal mentions of entities like the nation. Thus, although the input text has similar distribution in the CoNLL NER and ACE data set, the label distributions are very different.</Paragraph> <Paragraph position="2"> Current state-of-the-art systems for the ACE task (Florianetal.,2004)usethepredictionsofothernamed-entity null recognizers as features, that is, they use cascaded transfer. In this experiment, we test whether the transfer between these datasets can be further improved using joint decoding. We train a CRF entity recognizer on the ACE dataset, with the output of a named-entity entity recognizer trained on the CoNLL 2003 English data set. The CoNLL recognizer is the same CRF as was used in the previous experiment. In these results, we use a subset of 10% of the ACE training data. Table 3 lists the features we use. Table 4 compares the results on some representative entity types. Again, cascaded decoding for transfer is better than no transfer at al, and joint decoding is better than cascaded decoding. Interestingly, joint decoding has most impact on the harder nominal references, showing marked improvement over the cascaded approach.</Paragraph> </Section> </Section> <Section position="8" start_page="752" end_page="752" type="metho"> <SectionTitle> 7 Related Work </SectionTitle> <Paragraph position="0"> Researchers have begun to accumulate experimental evidence that joint training and decoding yields better performance than the cascaded approach. As mentioned earlier, the original work on dynamic CRFs (Sutton et al., 2004) demonstrated improvement due to joint training in the domains of part-of-speech tagging and noun-phrase chunking. Also, Carreras and Marquez (Carreras & M`arquez, 2004) have obtained increased performance in clause finding by training a cascade of perceptrons to minimize a single global error function. Finally, Miller et al. (Miller et al., 2000) have combined entity recognition, parsing, and relation extraction into a jointly-trained single statistical parsing model that achieves improved performance on all the subtasks.</Paragraph> <Paragraph position="1"> Part of the contribution of the current work is to suggest that joint decoding can be effective even when joint training is not possible because jointly-labeled data is unavailable. For example, Miller et al. report that they originally attempted to annotate newswire articles for all of parsing, relations, and named entities, but they stopped because the annotation was simply too expensive. Instead they hand-labeled relations only, assigning parse trees to the training set using a standard statistical parser, which is potentially less flexible than the cascaded training, because the model for main task is trained explicitly to match the noisy subtask predictions, rather than being free to correct them.</Paragraph> <Paragraph position="2"> In the speech community, it is common to compose separately trained weighted finite-state transducers (Mohri et al., 2002) for joint decoding. Our method extendsthisworktoconditionalmodels. Ordinarily, higher-level transducers depend only on the output of the previous transducer: a transducer for the lexicon, for example, consumes only phonemes, not the original speech signal. In text, however, such an approach is not sensible, because there is simply not enough information in the named-entity labels, for example, to do extraction if the original words are discarded. In a conditional model, weights in higher-level transducers are free to depend on arbitrary features of the original input without any additional complexity in the finite-state structure.</Paragraph> <Paragraph position="3"> Finally, stacked sequential learning (Cohen & Carvalho, 2005) is another potential method for combining the results of the subtask transducers. In this general meta-learning method for sequential classification, first a base classifier predicts the label at each time step, and then a higher-level classifier makes the final prediction, including as features a window of predictions from the base classifier. For transfer learning, this would correspond to having an independent base model for each sub-task (e.g., independent CRFs for named-entity and seminars), and then having a higher-level CRF that includes as a feature the predictions from the base models.</Paragraph> </Section> class="xml-element"></Paper>