File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-2003_intro.xml

Size: 8,429 bytes

Last Modified: 2025-10-06 14:02:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2003">
  <Title>Maximum Entropy Modeling in Sparse Semantic Tagging [?]</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Semantic analysis is an open research field in natural language processing. Two major research topics in this field are Named Entity Recognition (NER) (N. Wacholder and Choi, 1997; Cucerzan and Yarowsky, 1999) and Word Sense Disambiguation (WSD) (Yarowsky, 1995; Wilks and Stevenson, 1999). NER identifies different kinds of names such as &amp;quot;person&amp;quot;,&amp;quot;location&amp;quot; or &amp;quot;date&amp;quot;, while WSD distinguishes the senses of ambiguous words. For example, &amp;quot;bank&amp;quot; can be labeled as &amp;quot;financial institution&amp;quot; or &amp;quot;edge of a river&amp;quot;. Our task of semantic analysis has a more general purpose, tagging all nouns with one semantic label set. Compared with NE, which only considers names, our task concerns all nouns.</Paragraph>
    <Paragraph position="1"> Unlike WSD, in which every ambiguous word has its own range of sense set, our task aims at another set of semantic labels, shared by all nouns. The motivation behind this work is that a semantic category assignment with reliable performance can contribute to a number of applications including statistical machine translation and sub-tasks of information extraction.</Paragraph>
    <Paragraph position="2"> [?]This work was supported in part by NSF grant numbers IIS0121285. Any opinions, fndings and conclusions or recommendations expressed in this material are the authors' and do not necessarily reflect those of the sponsor.</Paragraph>
    <Paragraph position="3"> The semantic categories adopted in this paper come from the Longman Dictionary. These categories are neither parallel to nor independent of each other. One category may denote a concept which is a subset of that of another. Examples of the category structures are illustrated in Figure 1.</Paragraph>
    <Paragraph position="4">  Maximum Entropy (MaxEnt) principle has been successfully applied in many classification and tagging tasks (Ratnaparkhi, 1996; K. Nigam and A.McCallum, 1999; A. Mc-Callum and Pereira, 2000). We use MaxEnt modeling as the learning component. A major issue in MaxEnt training is how to select proper features and determine the feature targets (Berger et al., 1996; Jebara and Jaakkola, 2000). To discover useful features, we exploit the concept of Association Rules (AR) (R. Agrawal and Swami, 1993; Srikant and Agrawal, 1997), which is originally proposed in Data Mining research field to identify frequent itemsets in a large database.</Paragraph>
    <Paragraph position="5"> Like many other classification tasks, human-annotated data for semantic analysis is expensive and limited, while a large amount of unlabeled data is easily obtained. Many researchers (Blum and Mitchell, 1998; K. Nigam and Mitchell, 2000; Corduneanu and Jaakkola, 2002) have attempted to improve performance with unlabeled data. In this paper, we also propose a framework to bootstrap with unlabeled data. Fractional counts are assigned to unlabeled instances based on current model and accessible knowledge sources. Pooled with human-annotated data, unlabeled data contributes to the next MaxEnt model.</Paragraph>
    <Paragraph position="6"> We begin with an introduction of our bootstrapping framework and MaxEnt training. An interesting MaxEnt puzzle is presented, with its derivation showing possible directions of utilizing unlabeled data. Then the feature selection criterion guided by AR is discussed as well as the indicator selection (section 3). We discuss initialization methods for the unlabeled data. Strategies to guarantee convergence of bootstrapping process and approaches of integrating non-statistical knowledge are proposed in section 4. Finally, experimental results are presented along with conclusions and future work.</Paragraph>
    <Paragraph position="7"> 2 Bootstrapping and the MaxEnt puzzle An instance in the corpus includes the headword, which is the noun to be labeled, and its context. To integrating the unlabeled instances in the training process, we propose a tagging method called soft tagging. Unlike the normal tagging, in which each instance is assigned only one label, soft tagging allows one instance to be assigned several labels with each of them being associated with a fractional credit. All credits assigned to one instance should sum up to 1. For example, a raw instance with the headword &amp;quot;club&amp;quot; can be soft tagged as (movable J:0.3, not-movable N:0.3, collective U:0.4). Once all auxiliary instances have been assigned semantic labels, we pool them together with human-annotated data to select useful features and set up feature expectations. Then a log-linear model is trained for several iterations to maximize likelihood of the whole corpora. With the updated MaxEnt model, unlabeled data will be soft tagged again.</Paragraph>
    <Paragraph position="8"> The whole process is repeated until convergence condition is satisfied. This framework is illustrated in Figure 2.</Paragraph>
    <Paragraph position="9"> In building a mexent model, unlabeled data contributes differently to the feature selection and target estimation compared to the human-annotated data. While human-annotated instances never change, tags in unlabeled data keep updating according to the new MaxEnt model in each bootstrapping iteration, which might lead to a different feature set.</Paragraph>
    <Paragraph position="10">  Before presenting the MaxEnt puzzle, let's first consider a regular MaxEnt formulation. Let l be a label and x be an instance. Let kh(l,x) be the h-th binary feature function which is equal to 1 if (l,x) activates the h-th feature and 0 otherwise. P(l,x) is given by the model, denoting the probability of observing both l and x. hatwideP(l,x) is the empirical frequency of (l,x) in the training data. The constraint associated with feature kh(l,x) is represented as:</Paragraph>
    <Paragraph position="12"> In practice, we are interested in conditional models P(l|x), which assign probability mass to a label set given supporting evidence x. And we approximate P(x) with</Paragraph>
    <Paragraph position="14"> Next, we show how the effect of unlabeled data disappears during the bootstrapping in a restricted situation. We make the following assumptions:  1. The feature targets are the raw frequencies in the training data. No smoothing is applied.</Paragraph>
    <Paragraph position="15"> 2. During the bootstrapping, we maintain a fixed feature set, while the feature targets might change.</Paragraph>
    <Paragraph position="16"> 3. The fractional credit assigned to label l and instance x  in the t+ 1 iteration is Pt(l|x).</Paragraph>
    <Paragraph position="17"> Let N be the total number of instance(1,***,N) in labeled data and M be the total number of instance(N + 1,***,N + M) in unlabeled data. Let l be the weight assigned to unlabeled instances. The new constraint for the h-th feature function can be rewritten as:</Paragraph>
    <Paragraph position="19"> where t is the index of the bootstrap iterations. li is the human-annotated label attached to the ith instance.</Paragraph>
    <Paragraph position="20"> When the procedure converges, Pt(l|xi) = Pt+1(l|xi), the second parts on both sides will cancel each other. Finally, the constraint will turn out to be:</Paragraph>
    <Paragraph position="22"> which includes statistics only from labeled data. If the feature set stays the same, unlabeled data will make no contribution to the whole constraint set. A model which can satisfy Equation 3 must be able to satisfy Equation 4. If unlabeled instances satisfy the same distribution as labeled instances, given the same set of constraints, the model trained on only labeled data would be equivalent to the one trained on both.</Paragraph>
    <Paragraph position="23"> The above derivation shows that with the three restrictions, the soft tagged instances make no contribution to the constraint set, which is counter-intuitive. That is why we call it a &amp;quot;MaxEnt Puzzle&amp;quot;. Consequently, to utilize unlabeled data, we should break the three restrictions: to reselect feature set after each bootstrapping iteration, or adjust the soft tagging results given by the model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML