File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0408_evalu.xml
Size: 5,369 bytes
Last Modified: 2025-10-06 13:58:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0408"> <Title>Updating an NLP System to Fit New Domains: an empirical study on the sentence segmentation problem</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> We carried out experiments on the Brown, Reuters, and MedLine datasets. We randomly partition each dataset into training and testing. All methods are trained using only information from the training set, and their performance are evaluated on the test set. Each test set contains a111 a105a10a105a16a105 data points randomly selected. This sample size is chosen to make sure that an estimated accuracy based on these empirical samples will be reasonably close to the true accuracy. For a binary classifier, the standard deviation between the empirical mean a175 the error is less thana105 a108a165a182a10a164 ; if a175a176 a29 a105 a108a160a16a183 , then the standard deviation is no more than about a105 a108a97a147a164 . From the experiments, we see that the accuracy of all algorithms will be improved to about a105 a108a160a10a183 for all three datasets. Therefore the test set size we have is sufficiently large to distinguish a difference ofa105 a108a165a182a10a164 with reasonable confidence. Table 3 lists the test set performance of classifiers trained on the WSJ training set (denoted by WSJ), the training set from the same domain (that is, Brown, Reuters, and MedLine respectively for the corresponding testsets), denoted by Self, and their combination. This indicates upper limits on what can be achieved using the corresponding training set information. It is also interesting to see that the combination does not necessarily improve the performance. We compare different updating schemes based on the number of new labels required from the new domain. For this purpose, we use the following number of labeled instances: a11a105a16a105 a13a184a97a105a10a105 a13a87a111 a105a16a105 a13a37a183a105a10a105 and a11a74a185 a105a16a105 , corresponding to the &quot;new data&quot; column in the tables. For all experiments, if a specific result requires random sampling, then five different random runs were performed, and the corresponding result is reported in the format of &quot;meana186 std. dev.&quot; over the five runs.</Paragraph> <Paragraph position="1"> Table 4 contains the performance of classifiers trained on randomly selected data from the new domain alone. It is interesting to observe that even with a relatively small number of training examples, the corresponding classifiers can out-perform those obtained from the default WSJ training set, which contains a significantly larger amount of data. Clearly this indicates that in some NLP applications, using data with the right characteristics can be more important than using more data. This also provides strong evidence that one should update a classifier if the underlying domain is different from the training domain. null new data Brown Reuters MedLine With the same amount of newly labeled data, the improvement over the random method is significant. This shows that even though the domain has changed, training data from the old domain are still very useful. Observe that not only is the average performance improved, but the variance is also reduced. Note that in this table, we have fixed a173a151a26 a105 a108a182 . The performance with different a173 values on the MedLine dataset is reported in Table 6. It shows that different choices of a173 make relatively small differences in accuracy. At this point, it is interesting to check whether the estimated accuracy (using the method described for Table 2) reflects the change in performance improvement. The result is given in Table 7. Clearly the method we propose still leads to reasonable estimates.</Paragraph> <Paragraph position="2"> new data Brown Reuters MedLine augmented features, either with the random sampling scheme, or with the balancing scheme. It can be seen that with feature augmentation, the random sampling and the balancing schemes perform similarly. Although the feature augmentation method does not improve the overall performance (compared with balancing scheme alone), one advantage is that we do not have to rely on the old training data any more. In principle, one may even use a two-level classification scheme: use the old classifier if it gives a high confidence; use the new classifier trained on the new domain otherwise. However, we have not explored such combinations.</Paragraph> <Paragraph position="3"> new data Brown Reuters MedLine confidence based data selection, instead of random sampling. This method helps to some extent, but not as much as we originally expected. However, we have only used the simplest version of this method, which is susceptible to two problems mentioned earlier: it tends (a) to select data that are inherently hard to classify, and (b) to select redundant data. Both problems can be avoided with a more elaborated implementation, but we have not explored this. Another possible reason that using confidence based sample selection does not result in significant performance improvement is that for our examples, the performance is already quite good with even a small number of new samples.</Paragraph> </Section> class="xml-element"></Paper>