File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0408_intro.xml

Size: 5,107 bytes

Last Modified: 2025-10-06 14:01:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0408">
  <Title>Updating an NLP System to Fit New Domains: an empirical study on the sentence segmentation problem</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> An important issue for a statistical machine learning based NLP system is that its performance can depend heavily on the characteristics of the training data used to build the system. Consequently if we train a system on some data but apply it to other data with different characteristics, then the system's performance can degrade significantly. It is therefore natural to investigate the following related issues: a3 How to detect the change of underlying data characteristics, and to estimate the corresponding system performance degradation.</Paragraph>
    <Paragraph position="1"> a3 If performance degradation is detected, how to update a statistical system to improve its performance with as little human effort as possible.</Paragraph>
    <Paragraph position="2"> This paper investigates some methodological and practical aspects of the above issues. Although ideally such a study would include as many different statistical algorithms as possible, and as many different linguistic problems as possible (so that a very general conclusion might be drawn), in reality such an undertaking is not only difficult to carry out, but also can hide essential observations and obscure important effects that may depend on many variables. An alternative is to study a relatively simple and well-understood problem to try to gain understanding of the fundamental issues. Causal effects and essential observations can be more easily isolated and identified from simple problems since there are fewer variables that can affect the outcome of the experiments.</Paragraph>
    <Paragraph position="3"> In this paper, we take the second approach and focus on a specific problem using a specific underlying statistical algorithm. However, we try to use only some fundamental properties of the algorithm so that our methods are readily applicable to other systems with similar properties. Specifically, we use the sentence boundary detection problem to perform experiments since not only is it relatively simple and well-understood, but it also provides the basis for other more advanced linguistic problems.</Paragraph>
    <Paragraph position="4"> Our hope is that some characteristics of this problem are universal to language processing so that they can be generalized to more complicated linguistic tasks. In this paper we use the generalized Winnow method (Zhang et al., 2002) for all experiments. Applied to text chunking, this method resulted in state of the art performance. It is thus reasonable to conjecture that it is also suitable to other linguistic problems including sentence segmentation.</Paragraph>
    <Paragraph position="5"> Although issues addressed in this paper are very important for practical applications, there have only been limited studies on this topic in the existing literature.</Paragraph>
    <Paragraph position="6"> In speech processing, various adaption techniques have been proposed for language modeling. However, the language modeling problem is essentially unsupervised (density estimation) in the sense that it does not require any annotation. Therefore techniques developed there cannot be applied to our problems. Motivated from adaptive language modeling, transformation based adaptation techniques have also been proposed for certain supervised learning tasks (Gales and Woodland, 1996). However, typically they only considered very specific statistical models where the idea is to fit certain transformation parameters. In particular they did not consider the main issues investigated in this paper as well as generally applicable supervised adaptation methodologies such as what we propose. In fact, it will be very difficult to extend their methods to natural language processing problems that use different statistical models. The adaption idea in (Gales and Woodland, 1996) is also closely related to the idea of combining supervised and unsupervised learning in the same domain (Merialdo, 1994). In machine learning, this is often referred to as semi-supervised learning or learning with unlabeled data. Such methods are not always reliable and can often fail(Zhang and Oles, 2000). Although potentially useful for small distributional parameter shifts, they cannot recover labels for examples not (or inadequately) represented in the old training data. In such cases, it is necessary to use supervised adaption methods which we study in this paper. Another related idea is so-called active learning paradigm (Lewis and Catlett, 1994; Zhang and Oles, 2000), which selectively annotates the most informative data (from the same domain) so as to reduce the total number of annotations required to achieve a certain level of accuracy. See (Tang et al., 2002; Steedman et al., 2003) for related studies in statistical natural language parsing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML