File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1064_metho.xml
Size: 18,089 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1064"> <Title>A SNoW based Supertagger with Application to NP Chunking</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 SNoW </SectionTitle> <Paragraph position="0"> Sparse Network of Winnow (SNoW) (Roth, 1998) is a learning architecture that is specially tailored for learning in the presence of a very large number of features where the decision for a single sample depends on only a small number of features. Furthermore, SNoW can also be used as a general purpose multi-class classifier.</Paragraph> <Paragraph position="1"> It is noted in (Mu~noz et al., 1999) that one of the important properites of the sparse architecture of SNoW is that the complexity of processing an example depends only on the number of features active in it, a28a30a29 , and is independent of the total number of features, a28a32a31 , observed over the life time of the system and this is important in domains in which the total number of features in very large, but only a small number of them is active in each example.</Paragraph> <Paragraph position="2"> As far as supertagging is concerned, word context forms a very large space. However, for each word in a given sentence, only a small part of features in the space are related to the decision on supertag. Specifically the supertag of a word is determined by the appearances of certain words, POS tags, or supertags in its context. Therefore SNoW is suitable for the supertagging task.</Paragraph> <Paragraph position="3"> Supertagging can be viewed in term of the sequential model, which means that the selection of the supertag for a word is influenced by the decisions made on the previous few words. (Punyakanok and Roth, 2000) proposed three methods of using classifiers in sequential inference, which are HMM, PMM and CSCL. Among these three models, PMM is the most suitable for our task. The basic idea of PMM is as follows.</Paragraph> <Paragraph position="4"> Given an observation sequence a33 , we find the most likely state sequence a34 given a33 by maximiz-</Paragraph> <Paragraph position="6"> In this model, the output of SNoW is used to estimate a35a37a36a52a51 a38 a51a27a68a69a55 a64a16a40 and a35a67a54a59a36a52a51 a38a64a16a40 , where a51 is the current state, a51a27a68 is the previous state, and a64 is the current observation. a35a37a36a52a51 a38a51 a68 a55 a64a16a40 is separated to many subfunctions a35a71a70a62a72a66a36a52a51 a38a64a4a40 according to previous state a51a12a68 . In practice, a35a73a70 a72 a36a52a51 a38a64a16a40 is estimated in a wider window of the observed sequence, instead of a64 only. Then the problem is how to map the SNoW results into probabilities. In (Punyakanok and Roth, 2000), the sigmoid a9a12a74 a36 a9a63a75a77a76 a57a32a78a79a29a53a80a52a31a58a57a24a81a50a82 a40 is defined as confidence, where a83 is the threshold for SNoW, a84a6a85a87a86 is the dot product of the weight vector and the example vector. The confidence is normalized by summing to 1 and used as the distribution mass a35a73a70a88a72a89a36a52a51 a38a64a16a40 .</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Modeling Supertagging </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 A Novel Sequential Model with SNoW </SectionTitle> <Paragraph position="0"> Firstly we have to decide how to treat POS tags. One approach is to assign POS tags at the same time that we do supertagging. The other approach is to assign POS tags with a traditional POS tagger first, and then use them as input to the supertagger. Supertagging an unknown word becomes a problem for supertagging due to the huge size of the supertag set, Hence we use the second approach in our paper. We first run the Brill POS tagger (Brill, 1995) on both the training and the test data, and use POS tags as part of the input.</Paragraph> <Paragraph position="1"> Let a90 a42 a91 a54 a91a39a49a65a5a56a5a56a5a8a91 a45 be the sentence, a92 a42 the current word, instead of the supertag of the previous word.</Paragraph> <Paragraph position="2"> a111 To avoid sparse-data problem. There are 479 supertags in the set of hand-coded supertags, and almost 3000 supertags in the set of supertags extracted from Penn Treebank.</Paragraph> <Paragraph position="3"> a111 Supertags related to the same POS tag are more difficult to distinguish than supertags related to different POS tags. Thus by defining a classifier on the POS tag of the current word but not the POS tag of the previous word forces the learning algorithm to focus on difficult cases.</Paragraph> <Paragraph position="4"> a111 Decomposition of the probability estimation can decrease the complexity of the learning algorithm and allows the use of different parameters for different POS tags.</Paragraph> <Paragraph position="6"> ing to the previous supertags a86 a68 . Following the estimation of distribution function in (Punyakanok and Roth, 2000), we define confidence with a sigmoid</Paragraph> <Paragraph position="8"> where a51 is the threshold of a112a113a107 , and a115 is set to 1.</Paragraph> <Paragraph position="9"> The distribution mass is then defined with normal-</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Label Bias Problem </SectionTitle> <Paragraph position="0"> In (Lafferty et al., 2001), it is shown that PMM and other non-generative finite-state models based on next-state classifiers share a weakness which they called the label bias problem: the transitions leaving a given state compete only against each other, rather than against all other transitions in the model. They proposed Conditional Random Fields (CRFs) as solution to this problem.</Paragraph> <Paragraph position="1"> (Collins, 2002) proposed a new algorithm for parameter estimation as an alternate to CRF. The new algorithm was similar to maximum-entropy model except that it skipped the local normalization step.</Paragraph> <Paragraph position="2"> Intuitively, it is the local normalization that makes distribution mass of the transitions leaving a given state incomparable with all other transitions.</Paragraph> <Paragraph position="3"> It is noted in (Mu~noz et al., 1999) that SNoW's output provides, in addition to the prediction, a robust confidence level in the prediction, which enables its use in an inference algorithm that combines predictors to produce a coherent inference. In that paper, SNoW's output is used to estimate the probability of open and close tags. In general, the probability of a tag can be estimated as follows</Paragraph> <Paragraph position="5"> as one of the anonymous reviewers has suggested.</Paragraph> <Paragraph position="6"> However, this makes probabilities comparable only within the transitions of the same history a86 a68 . An alternative to this approach is to use the SNoW's output directly in the prediction combination, which makes transitions of different history comparable, since the SNoW's output provides a robust confidence level in the prediction. Furthermore, in order to make sure that the confidences are not too sharp, we use the confidence defined in (3).</Paragraph> <Paragraph position="7"> In addition, we use two supertaggers, one scans from left to right and the other scans from right to left. Then we combine the results via pairwise voting as in (van Halteren et al., 1998; Chen et al., 1999) as the final supertag. This approach of voting also helps to cope with the label bias problem.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Contextual Model </SectionTitle> <Paragraph position="0"> A basic feature is called active for word a91 a100 if and only if the corresponding word/POS-tag/supertag appears at a specified place around a91 a100 . For our SNoW classifiers we use unigram and bigram of basic features as our feature set. A feature defined as a bigram of two basic features is active if and only if the two basic features are both active. The value of a feature of a91 a100 is set to 1 if this feature is active for</Paragraph> <Paragraph position="2"/> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Related Work </SectionTitle> <Paragraph position="0"> (Chen, 2001) implemented an MEMM model for supertagging which is analogous to the POS tagging model of (Ratnaparkhi, 1996). The feature sets used in the MEMM model were similar to ours. In addition, prefix and suffix features were used to handle rare words. Several MEMM supertaggers were implemented based on distinct feature sets.</Paragraph> <Paragraph position="1"> In (Mu~noz et al., 1999), SNoW was used for text chunking. The IOB tagging model in that paper was similar to our model for supertagging, but there are some differences. They did not decompose the SNoW classifier with respect to POS tags. They used two-level deterministic ( beam-width=1 ) search, in which the second level IOB classifier takes the IOB output of the first classifier as input features.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experimental Evaluation and Analysis </SectionTitle> <Paragraph position="0"> In our experiments, we use the default settings of the SNoW promotion parameter, demotion parameter and the threshold value given by the SNoW system. We train our model on the training data for 2 rounds, only counting the features that appear for at least 5 times. We skip the normalization step in test, and we use beam search with the width of 5.</Paragraph> <Paragraph position="1"> In our first experiment, we use the same dataset as that of (Chen et al., 1999) for our experiments.</Paragraph> <Paragraph position="2"> We use WSJ section 00 through 24 expect section 20 as training data, and use section 20 as test data.</Paragraph> <Paragraph position="3"> Both training and test data are first tagged by Brill's POS tagger (Brill, 1995). We use the same pair-wise voting algorithm as in (Chen et al., 1999). We run supertagging on the training data and use the supertagging result to generate the mapping table used in pairwise voting.</Paragraph> <Paragraph position="4"> The SNoW supertagger scanning from left to right achieves an accuracy of a2a4a3a6a5a14a13a16a3a16a11 , and the one scanning from right to left achieves an accuracy of a2a19a9a21a5a8a7a136a15a16a11 . By combining the results of these two supertaggers with pairwise voting, we achieve an accuracy of a2a4a3a6a5a8a7a10a9a12a11 , an error reduction of a3a6a5a26a9a12a11 compared to a2a4a3a6a5a17a3a4a18a16a11 , the best supertagging result to date (Chen, 2001). Table 1 shows the comparison with previous work.</Paragraph> <Paragraph position="5"> Our algorithm, which is coded in Java, takes about 10 minutes to supertag the test data with a P3 1.13GHz processor. However, in (Chen, 2001), the accuracy of a2a4a3a6a5a17a3a4a18a16a11 was achieved by a Viterbi search program that took about 5 days to supertag the test data. The counterpart of our algorithm in (Chen, 2001) is the beam search on Model 8 with width of 5, which is the same as the beam width in our algorithm. Compared with this program, our algorithm achieves an error reduction of a22a23a5a26a9a12a11 . (Chen et al., 1999) achieved an accuracy of a2a4a3a6a5a26a9a27a2a16a11 by combination of 5 distinct supertaggers. However, our result is achieved by combining outputs of two homogeneous supertaggers, which only differ in scan direction.</Paragraph> <Paragraph position="6"> Our next experiment is with the set of supertags abstracted from PTB with Fei Xia's LexTract (Xia, 2001). Xia extracted an LTAG-style grammar from PTB, and repeated Srinivas' experiment (Srinivas, 1997) on her supertag set. There are 2920 elemenmodel acca11 data is WSJ section 00 thorough 24 except section 20 of PTB. Test data is WSJ section 20. Size of tag set is 479. acca11 = percentage of accuracy. The number of Srinivas(97) is based on footnote 1 of (Chen et al., 1999). The number of Chen(01) width=5 is the result of a beam search on Model 8 with the width of 5.</Paragraph> <Paragraph position="7"> Training data is WSJ section 02 thorough 21 of PTB.</Paragraph> <Paragraph position="8"> Test data is WSJ section 22 and 23. Size of supertag set is 2920. acca11 = percentage of accuracy.</Paragraph> <Paragraph position="9"> tary trees in Xia's grammar a137a138a49 , so that the supertags are more specialized and hence there is much more ambiguity in supertagging. We have experimented with our model on a137a139a49 and her dataset. We train our left-to-right model on WSJ section 02 through 21 of PTB, and test on section 22 and 23. We achieve an average error reduction of a9a27a15a6a5a14a13a136a11 . The reason why the accuracy is rather low is that systems using a137a138a49 have to cope with much more ambiguities due the large size of the supertag set. The results are shown in Table 2.</Paragraph> <Paragraph position="10"> We test on both normalized and unnormalized models with both hand coded supertag set and auto-extracted supertag set. We use the left-to-right SNoW model in these experiments. The results in Table 3 show that skipping the local normalization improves performance in all the systems. The effect of skipping normalization is more significant on auto-extracted tags. We think this is because sparse tag set size norm? acca11 (20/22/23) size = size of the tag set. norm? = normalized or not. acca11 = percentage of accuracy on section 20, 22 and 23. auto = auto-extracted tag set. hand = hand coded tag set.</Paragraph> <Paragraph position="11"> data is more vulnerable to the label bias problem.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Application to NP Chunking </SectionTitle> <Paragraph position="0"> Now we come back to the NP chunking problem.</Paragraph> <Paragraph position="1"> The standard dataset of NP chunking consists of WSJ section 15-18 as train data and section 20 as test data. In our approach, we substitute the supertags for the POS tags in the dataset. The new data look as follows.</Paragraph> <Paragraph position="2"> For B Pnxs O</Paragraph> <Paragraph position="4"> The first field is the word, the second is the supertag of the word, and the last is the IOB tag.</Paragraph> <Paragraph position="5"> We first use the fast TBL (Ngai and Florian, 2001), a Transformation Based Learning algorithm, to repeat Ramshaw and Marcus' experiment, and then apply the same program to our new dataset. Since section 15-18 and section 20 are in the standard data set of NP chunking, we need to avoid using these sections as training data for our supertagger. We have trained another supertagger that is trained on 776K words in WSJ section 02-14 and 21-24, and it is tuned with 44K words in WSJ section 19. We use this supertagger to supertag section 15-18 and section 20. We train an NP Chunker on section 15-18 with fast TBL, and test it on section 20.</Paragraph> <Paragraph position="6"> There is a small problem with the supertag set that we have been using, as far as NP chunking is concerned. Two words with different POS tags may be tagged with the same supertag. For example both determiner (DT) and number (CD) can be tagged with B Dnx. However this will be harmful in the case WSJ section 15-18 of PTB. Test data is WSJ section 20. A = Accuracy of IOB tagging. P = NP chunk Precision. R = NP chunk Recall. F = F-score. Brill-POS = fast TBL with Brill's POS tags. Tri-STAG = fast TBL with supertags given by Srinivas' trigram-based supertagger. SNoW-STAG = fast TBL with supertags given by our SNoW supertagger. SNoW-STAG2 = fast TBL with augmented supertags given by our SNoW supertagger. GOLD-POS = fast TBL with gold standard POS tags. GOLD-STAG = fast TBL with gold standard supertags.</Paragraph> <Paragraph position="7"> of NP Chunking. As a solution, we use augmented supertags that have the POS tag of the lexical item specified. An augmented supertag can also be regarded as concatenation of a supertag and a POS tag.</Paragraph> <Paragraph position="9"> The results are shown in Table 4. The system using augmented supertags achieves an F-score of a2a4a3a6a5a17a2a4a18a16a11 , or an error reduction of a9a4a9a21a5a17a140a16a11 below the baseline of using Brill POS tags. Although these two systems are both trained with the same TBL algorithm, we implicitly employ more linguistic knowledge as the learning bias when we train the learning machine with supertags. Supertags encode more syntactical information than POS tag do.</Paragraph> <Paragraph position="10"> For example, in the sentence Three leading drug companies ..., the POS tag of a141a52a76a12a84 a135a136a142 a28a50a143 is VBG, or present participle. Based on the local context of a141a52a76a27a84 a135a16a142 a28a50a143 , Three can be the subject of leading. However, the supertag of leading is B An, which represents a modifier of a noun. With this extra information, the chunker can easily solve the ambiguity. We find many instances like this in the test data.</Paragraph> <Paragraph position="11"> It is important to note that the accuracy of supertag itself is much lower than that of POS tag while the use of supertags helps to improve the over-all performance. On the other hand, since the accuracy of supertagging is rather lower, there is more room left for improving.</Paragraph> <Paragraph position="12"> If we use gold standard POS tags in the previous experiment, we can only achieve an F-score of a2a4a15a6a5a17a15a65a7a23a11 . However, if we use gold standard supertags in our previous experiment, the F-score is as high as a2a4a18a6a5a26a9a27a140a16a11 . This tells us how much room there is for further improvements. Improvements in supertagging may give rise to further improvements in chunking.</Paragraph> </Section> class="xml-element"></Paper>