File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0408_metho.xml
Size: 17,097 bytes
Last Modified: 2025-10-06 14:08:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0408"> <Title>Updating an NLP System to Fit New Domains: an empirical study on the sentence segmentation problem</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Generalized Winnow for Sentence </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Boundary Detection </SectionTitle> <Paragraph position="0"> For the purpose of this paper, we consider the following form of the sentence boundary detection problem: to determine for each period &quot;.&quot; whether it denotes a sentence boundary or not (most non-sentence boundary cases occur in abbreviations). Although other symbols such as &quot;?&quot; and &quot;!&quot; may also denote sentence boundaries, they occur relatively rarely and when they occur, are easy to determine. There are a number of special situations, for example: three (or more) periods to denote omission, where we only classify the third period as an end of sentence marker. The treatment of these special situations are not important for the purpose of this paper.</Paragraph> <Paragraph position="1"> The above formulation of the sentence segmentation problem can be treated as a binary classification problem. One method that has been successfully applied to a number of linguistic problems is the Winnow algorithm (Littlestone, 1988; Khardon et al., 1999). However, a drawback of this method is that the algorithm does not necessarily converge for data that are not linearly separable. A generalization was recently proposed, and applied to the text chunking problem (Zhang et al., 2002), where it was shown that this generalization can indeed improve the performance of Winnow.</Paragraph> <Paragraph position="2"> Applying the generalized Winnow algorithm on the sentence boundary detection problem is straight forward since the method solves a binary classification problem directly. In the following, we briefly review this algorithm, and properties useful in our study.</Paragraph> <Paragraph position="3"> Consider the binary classification problem: to determine a label a4a6a5a8a7a10a9a12a11a10a13a14a11a16a15 associated with an input vector a17 . A useful method for solving this problem is through linear discriminant functions, which consist of linear combinations of components of the input vector.</Paragraph> <Paragraph position="4"> Specifically, we seek a weight vectora18 and a thresholda19 with the following decision rule: ifa18a21a20a22a17a24a23a25a19 we predict that the label a4a27a26a28a9a12a11 , and if a18a21a20a22a17a30a29a31a19 , we predict that the label a4a32a26a33a11 . We denote by a34 the dimension of the weight vectora18 which equals the dimension of the input vectora17 . The weighta18 and thresholda19 can be computed from the generalized Winnow method, which is based on the following optimization problem:</Paragraph> <Paragraph position="6"> The numerical method which we use to solve this problem, as presented in Algorithm 1, is based on a dual formulation of the above problem. See (Zhang et al., 2002) for detailed derivation of the algorithm and its relationship with the standard Winnow.</Paragraph> <Paragraph position="7"> In all experiments, we use the same parameters suggested in (Zhang et al., 2002) for the text chunking problem: a109a110a26a112a111 a105 a13a70 a26 a105 a108a103a11 , a113a30a26 a105 a108a105 a11 , and</Paragraph> <Paragraph position="9"> above parameter choices may not be optimal for sentence segmentation. However since the purpose of this paper is not to demonstrate the best possible sentence segmentation system using this approach, we shall simply fix these parameters for all experiments.</Paragraph> <Paragraph position="10"> Algorithm 1 (Generalized Winnow) input: training data a35a17 a65a13a37a4 a65a38a114a13a14a108a74a108a14a108a53a13a35a17</Paragraph> <Paragraph position="12"> output: weight vectora18 and thresholda19</Paragraph> <Paragraph position="14"/> <Paragraph position="16"> It was shown in (Zhang et al., 2002) that if a35a18a36a13a87a19a10a38 is obtained from Algorithm 1, then it also approximately a11a100a38a37a159a42a97 can be regarded as an estimate for the in-class conditional probability. As we will see, this property will be very useful for our purposes. For each period in the text, we construct a feature vector a17 as the input to the generalized Winnow algorithm, and use its prediction to determine whether the period denotes a sentence boundary or not. In order to constructa17 , we consider linguistic features surrounding the period, as listed in Table 1. Since the feature construction routine is written in the Java language, &quot;type of character&quot; features correspond to the Java character types, which can be found in any standard Java manual. We picked these features by looking at features used previously, as well as adding some of our own which we thought might be useful. However, we have not examined which features are actually important to the algorithm (for example, by looking at the size of the weights), and which features are not.</Paragraph> <Paragraph position="17"> We use an encoding scheme similar to that of (Zhang et al., 2002). For each data point, the associated features are encoded as a binary vector a17 . Each component of a17 corresponds to a possible feature value a89 of a feature a80in Table 1. The value of the component corresponds to a test which has value one if the corresponding feature a80has value a89 , or value zero if the corresponding feature a80has another feature value.</Paragraph> <Paragraph position="18"> token before the period token after the period character to the right type of character to the right character to the left type of character to the left character to the right of blank after word type of character to the right of blank after word character left of first character of word type of character left of first character of word first character of the preceding word type of first character of the preceding word length of preceding word distance to previous period The features presented here may not be optimal. In particular, unlike (Zhang et al., 2002), we do not use higher order features (for example, combinations of the above features). However, this list of features has already given good performance, comparing favorably with previous approaches (see (Reynar and Ratnaparkhi, 1997; Mikheev, 2000) and references therein).</Paragraph> <Paragraph position="19"> The standard evaluation data is the Wall-Street Journal (WSJ) tree-bank. Based on our processing scheme, the training set contains about seventy-four thousand periods, and the test set contains about thirteen thousand periods. If we train on the training set, and test on the test set, the accuracy is a160a16a160a161a108a163a162a16a164 . Another data set which has been annotated is the Brown corpus. If we train on the WSJ training set, and test on the Brown corpus, the accuracy isa160a10a160a161a108a165a97a10a164 . The error rate is three times larger.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental Design and System Update </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Methods </SectionTitle> <Paragraph position="0"> In our study of system behavior under domain changes, we have also used manually constructed rules to filter out some of the periods. The specific set of rules we have used are: a3 If a period terminates a non-capitalized word, and is followed by a blank and a capitalized word, then we predict that it is a sentence boundary.</Paragraph> <Paragraph position="1"> a3 If a period is both preceded and followed by alphanumerical characters, then we predict that it is not a sentence boundary.</Paragraph> <Paragraph position="2"> The above rules achieve error rates of less than a105 a108a50a11a100a164 on both the WSJ and Brown datasets, which is sufficient for our purpose. Note that we did not try to make the above rules as accurate as possible. For example, the first rule will misclassifiy situations such as &quot;A vs. B&quot;. Eliminating such mistakes is not essential for the purpose of this study.</Paragraph> <Paragraph position="3"> All of our experiments are performed and reported on the remaining periods that are not filtered out by the above manual rules. In this study, the filtering scheme serves two purposes. The first purpose is to magnify the errors. Roughly speaking, the rules will classify more than half of the periods. These periods are also relatively easy to classify using a statistical classifier. Therefore the error rate on the remaining periods is more than doubled. Since the sentence boundary detection problem has a relatively small error rate, this magnification effect is useful for comparing different algorithms. The second purpose is to reduce our manual labeling effort. In this study, we had used a number of datasets that are not annotated. Therefore for experimentation purpose, we have to label each period manually.</Paragraph> <Paragraph position="4"> After filtering, the WSJ training set contains about twenty seven thousand data points, and the test set contains about five thousand data points. The Brown corpus contains about seventeen thousand data points. In addition, we also manually labeled the following data: a3 Reuters: This is a standard dataset for text categorization, available from classifier trained on WSJ does not perform nearly as well on some of the other data sets. However it is useful to examine the source of these extra errors. We observed that most of the errors are clearly caused by the fact that other domains contain examples that are not represented in the WSJ training set. There are two sources for these previously unseen examples: 1. change of writing style; 2. new linguistic expressions. For example, quote marks are represented as two single quote (or back quote) characters in WSJ, but typically as one double quote character elsewhere. In some data sets such as Reuters, phrases such as &quot;U.S. Economy&quot; or &quot;U.S. Dollar&quot; frequently have the word after the country name capitalized (they also appear in lower case sometimes, in the same data). The above can be considered as a change of writing style. In some other cases, new expressions may occur. For example, in the MedLine data, new expressions such as &quot;4 degrees C.&quot; are used to indicate temperature, and expressions such as &quot;Bioch. Biophys. Res. Commun. 251, 744-747&quot; are used for citations. In addition, new acronyms and even formulas containing tokens ending with periods occur in such domains.</Paragraph> <Paragraph position="5"> It is clear that the majority of errors are caused by data that are not represented in the training set. This fact suggests that when we apply a statistical system to a new domain, we need to check whether the domain contains a significant number of previously unseen examples which may cause performance deterioration. This can be achieved by measuring the similarity of the new test domain to the training domain. One way is to compute statistics on the training domain, and compare them to statistics computed on the new test domain; another way is to calculate a properly defined distance between the test data and the training data. However, it is not immediately obvious what data statistics are important for determining classification performance. Similarly it is not clear what distance metric would be good to use. To avoid such difficulties, in this paper we assume that the classifier itself can provide a confidence measure for each prediction, and we use this information to estimate the classifier's performance.</Paragraph> <Paragraph position="6"> As we have mentioned earlier, the generalized Winnow method approximately minimizes the quantity</Paragraph> <Paragraph position="8"> a11a100a38a37a159a42a97 as an estimate of the conditional probability a144 a35a4a46a26a171a11a147a146a17a55a38 . From simple algebra, we obtain an estimate of the classification error as a142a134a143a55a146a163a11a12a9a99a149 a35a18a158a20a127a17a61a9a32a19a10a38a14a146a159a42a97 . Since a149 a35a18a21a20a53a17a172a9a32a19a10a38 is only an approximation of the conditional probability, this estimate may not be entirely accurate. However, one would expect it to give a reasonably indicative measure of the classification performance. In Table 2, we compare the true classification accuracy from the annotated test data to the estimated accuracy using this method. It clearly shows that this estimate indeed correlates very well with the true classification performance. Note that this estimate does not require knowing the true labels of the data.</Paragraph> <Paragraph position="9"> Therefore we are able to detect the potential performance degradation of the classifier on a new domain using this metric without the ground truth information.</Paragraph> <Paragraph position="10"> accuracy WSJ Brown Reuters MedLine As pointed out before, a major source of error for a new application domain comes from data that are not represented in the training set. If we can identify those data, then a natural way to enhance the underlying classifier's performance would be to include them in the training data, and then retrain. However, a human is required to obtain labels for the new data, but our goal is to reduce the human labeling effort as much as possible. Therefore we examine the potential of using the classifier to determine which part of the data it has difficulty with, and then ask a human to label that part. If the underlying classifier can provide confidence information, then it is natural to assume that confidence for unseen data will likely be low. Therefore for labeling purposes, one can choose data from the new domain for which the confidence is low. This idea is very similar to certain methods used in active learning. In particular a confidence-based sample selection scheme was proposed in (Lewis and Catlett, 1994). One potential problem for this approach is that by choosing data with lower confidence levels, noisy data that are difficult to classify tend to be chosen; another problem is that it tends to choose similar data multiple times. However, in this paper we do not investigate methods that solve these issues.</Paragraph> <Paragraph position="11"> For baseline comparison, we consider the classifier obtained from the old training data (see Table 3), as well as classifiers trained on random samples from the new domain (see Table 4). In this study, we explore the following three ideas to improve the performance: a3 Data balancing: Merge labeled data from the new domain with the existing training data from the old domain; we also balance their relative proportion so that the effect of one domain does not dominate the other.</Paragraph> <Paragraph position="12"> a3 Feature augmentation: Use the old classifier (first level classifier) to create new features for the data, and then train another classifier (second level classifier) with augmented features (on newly labeled data from the new domain).</Paragraph> <Paragraph position="13"> a3 Confidence based feature selection: Instead of random sampling, select data from the new domain with lowest confidence based on the old classifier.</Paragraph> <Paragraph position="14"> One may combine the above ideas. In particular, we will compare the following methods in this study: a3 Random: Randomly selected data from the new domain. null a3 Balanced: Use WSJ training set + randomly selected data from the new domain. However, we supersample the randomly selected data so that the effective sample size isa173 -times that of the WSJ training set, wherea173 is a balancing factor.</Paragraph> <Paragraph position="15"> a3 Augmented (Random): Use the default classifier output to form additional features. Then train a second level classifier on randomly selected data from the new domain, with these additional features.</Paragraph> <Paragraph position="16"> In our experiments, four binary features are added; they correspond to tests a26a150a18a158a20a127a17a68a9a174a19 is the output of the first level classifier).</Paragraph> <Paragraph position="17"> a3 Augmented-balanced: As indicated, use additional features as well as the original WSJ training set for the second level classifier.</Paragraph> <Paragraph position="18"> a3 Confidence-Balanced: Instead of random sampling from the new domain, choose the least confident data (which is more likely to provide new information), and then balance with the WSJ training set. a3 Augmented-Confidence-Balanced: This method is similar to Augmented-balanced. However, we label the least confident data instead of random sampling.</Paragraph> </Section> </Section> class="xml-element"></Paper>