File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0721_metho.xml

Size: 13,386 bytes

Last Modified: 2025-10-06 14:07:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0721">
  <Title>Maximum entropy Markov models for information extraction and segmentation. In Proc. of ICML-</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Identifying Phrase Structure
</SectionTitle>
    <Paragraph position="0"> The problem of identifying phrase structure can be formalized as follows. Given an input string O =&lt; ol, 02,..., On &gt;, a phrase is a substring of consecutive input symbols oi, oi+l,...,oj.</Paragraph>
    <Paragraph position="1"> Some external mechanism is assumed to consistently (or stochastically) annotate substrings as phrases 2. Our goal is to come up with a mechanism that, given an input string, identifies the phrases in this string, this is a fundamental task with applications in natural language (Church, 1988; Ramshaw and Marcus, 1995; Mufioz et al., 1999; Cardie and Pierce, 1998).</Paragraph>
    <Paragraph position="2"> The identification mechanism works by using classifiers that process the input string and recognize in the input string local signals which  each input symbol is either in a phrase or outside it. All the methods we discuss can be extended to deal with several kinds of phrases in a string, including different kinds of phrases and embedded phrases.</Paragraph>
    <Paragraph position="3"> are indicative to the existence of a phrase. Local signals can indicate that an input symbol o is inside or outside a phrase (IO modeling) or they can indicate that an input symbol o opens or closes a phrase (the OC modeling) or some combination of the two. In any case, the local signals can be combined to determine the phrases in the input string. This process, however, needs to satisfy some constraints for the resulting set of phrases to be legitimate. Several types of constraints, such as length and order can be formalized and incorporated into the mechanisms studied here. For simplicity, we focus only on the most basic and common constraint - we assume that phrases do not overlap.</Paragraph>
    <Paragraph position="4"> The goal is thus two-fold: to learn classifiers that recognize the local signals and to combine these in a ways that respects the constraints.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="107" type="metho">
    <SectionTitle>
2 Markov Modeling
</SectionTitle>
    <Paragraph position="0"> HMM is a probabilistic finite state automaton used to model the probabilistic generation of sequential processes. The model consists of a finite set S of states, a set (9 of observations, an initial state distribution P1 (s), a state-transition distribution P(s\[s') for s, # E S and an observation distribution P(o\[s) for o E (9 and s 6 S. 3 In a supervised learning task, an observation sequence O --&lt; ol,o2,... On &gt; is supervised by a corresponding state sequence S =&lt; sl, s2,. * * sn &gt;. The supervision can also be supplied, as described in Sec. 1, using the local signals. Constraints can be incorporated into the HMM by constraining the state transition probability distribution P(s\]s'). For example, set P(sV) = 0 for all s, s' such that the transition from s ~ to s is not allowed.</Paragraph>
    <Paragraph position="1"> aSee (Rabiner, 1989) for a comprehensive tutorial.  Combining HMM and classifiers (artificial neural networks) has been exploited in speech recognition (Morgan and Bourlard, 1995), however, with some differences from this work.</Paragraph>
    <Section position="1" start_page="107" end_page="107" type="sub_section">
      <SectionTitle>
2.1 HMM with Classifiers
</SectionTitle>
      <Paragraph position="0"> To recover the most likely state sequence in HMM, we wish to estimate all the required probability distributions. As in Sec. 1 we assume to have local signals that indicate the state. That is, we are given classifiers with states as their outcomes. Formally, we assume that Pt(slo ) is given where t is the time step in the sequence. In order to use this information in the HMM framework, we compute</Paragraph>
      <Paragraph position="2"> instead of observing the conditional probability Pt (ols) directly from training data, we compute it from the classifiers' output. Pt(s) can be cal-</Paragraph>
      <Paragraph position="4"> bution for the HMM. For each t, we can treat Pt(ols ) in Eq. 1 as a constant r/t because our goal is only to find the most likely sequence of states for given observations which are the same for all compared sequences. Therefore, to compute the most likely sequence, standard dynamic programming (Viterbi) can still be applied.</Paragraph>
    </Section>
    <Section position="2" start_page="107" end_page="107" type="sub_section">
      <SectionTitle>
2.2 Projection based Markov Model
</SectionTitle>
      <Paragraph position="0"> In HMMs, observations are allowed to depend only on the current state and long term dependencies are not modeled. Equivalently, from the constraint point of view, the constraint structure is restricted by having a stationary probability distribution of a state given the previous one. We attempt to relax this by allowing the distribution of a state to depend, in addition to the previous state, on the observation. Formally, we make the independence assumption:</Paragraph>
      <Paragraph position="2"> Thus, we can find the most likely state sequence</Paragraph>
      <Paragraph position="4"> Hence, this model generalizes the standard HMM by combining the state-transition probability and the observation probability into one function. The most likely state sequence can still be recovered using the dynamic programming algorithm over the Eq.3.</Paragraph>
      <Paragraph position="5"> In this model, the classifiers' decisions are incorporated in the terms P(sls',o ) and Pl(slo ). In learning these classifiers we project P(sls ~, o) to many functions Ps' (slo) according to the previous states s ~. A similar approach has been developed recently in the context of maximum entropy classifiers in (McCallum et al., 2000).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="107" end_page="108" type="metho">
    <SectionTitle>
3 Constraint Satisfaction with
Classifiers
</SectionTitle>
    <Paragraph position="0"> The approach is based on an extension of the Boolean constraint satisfaction formalism (Mackworth, 1992) to handle variables that are outcomes of classifiers. As before, we assume an observed string 0 =&lt; ol,o2,... On &gt; and local classifiers that, w.l.o.g., take two distinct values, one indicating the openning a phrase and a second indicating closing it (OC modeling). The classifiers provide their outputs in terms of the probability P(o) and P(c), given the observation.</Paragraph>
    <Paragraph position="1"> To formalize this, let E be the set of all possible phrases. All the non-overlapping constraints can be encoded in: f --/ke~ overlaps ej (-~eiV-~ej). Each solution to this formulae corresponds to a legitimate set of phrases.</Paragraph>
    <Paragraph position="2"> Our problem, however, is not simply to find an assignment : E -+ {0, 1} that satisfies f but rather to optimize some criterion. Hence, we associate a cost function c : E ~ \[0,1\] with each variable, and then find a solution ~of f of minimum cost, c(~-) = n Ei=l In phrase identification, the solution to the optimization problem corresponds to a shortest path in a directed acyclic graph constructed on the observation symbols, with legitimate phrases (the variables in E) as its edges and their costs as the weights. Each path in this graph corresponds to a satisfying assignment and the shortest path is the optimal solution.</Paragraph>
    <Paragraph position="3"> A natural cost function is to use the classifiers probabilities P(o) and P(c) and define, for a phrase e = (o, c), c(e) = 1 - P(o)P(c) which means that the error in selecting e is the error in selecting either o or c, and allowing those  to overlap 4. The constant in 1 - P(o)P(c) biases the minimization to prefers selecting a few phrases, possibly no phrase, so instead we minimize -P(o) P(c).</Paragraph>
  </Section>
  <Section position="5" start_page="108" end_page="108" type="metho">
    <SectionTitle>
4 Shallow Parsing
</SectionTitle>
    <Paragraph position="0"> The above mentioned approaches are evaluated on shallow parsing tasks, We use the OC modeling and learn two classifiers; one predicting whether there should be a open in location t or not, and the other whether there should a close in location t or not. For technical reasons it is easier to keep track of the constraints if the cases --1 o and --1 c are separated according to whether we are inside or outside a phrase. Consequently, each classifier may output three possible outcomes O, nOi, nOo (open, not open inside, not open outside) and C, nCi, nCo, resp. The state-transition diagram in figure 1 captures the order constraints. Our modeling of the problem is a modification of our earlier work on this topic that has been found to be quite successful compared to other learning methods attempted on this problem (Mufioz et al., 1999) and in particular, better than the IO modeling of the problem (Mufioz et al., 1999).</Paragraph>
    <Paragraph position="1">  phrase recognition problem.</Paragraph>
    <Paragraph position="2"> The classifier we use to learn the states as a function of the observations is SNoW (Roth, 1998; Carleson et al., 1999), a multi-class classifter that is specifically tailored for large scale learning tasks. The SNoW learning architecture learns a sparse network of linear functions, in which the targets (states, in this case) are represented as linear functions over a common feature space. Typically, SNoW is used as a classifier, and predicts using a winner-take-all 4Another solution in which the classifiers' suggestions inside each phrase axe also accounted for is possible. mechanism over the activation value of the taxget classes in this case. The activation value itself is computed using a sigmoid function over the linear sum. In this case, instead, we normalize the activation levels of all targets to sum to 1 and output the outcomes for all targets (states).</Paragraph>
    <Paragraph position="3"> We verified experimentally on the training data that the output for each state is indeed a distribution function and can be used in further processing as P(slo ) (details omitted).</Paragraph>
  </Section>
  <Section position="6" start_page="108" end_page="109" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We experimented both with base noun phrases (NP) and subject-verb patterns (SV) and show results for two different representations of the observations (that is, different feature sets for the classifiers) - part of speech (POS) tags only and POS with additional lexical information (words). The data sets used are the standard data sets for this problem (Ramshaw and Maxcus, 1995; Argamon et al., 1999; Mufioz et al., 1999; Tjong Kim Sang and Veenstra, 1999) taken from the Wall Street Journal corpus in the Penn Treebank (Marcus et al., 1993).</Paragraph>
    <Paragraph position="1"> For each model we study three different classifiers. The simple classifier corresponds to the standard HMM in which P(ols ) is estimated directly from the data. The NB (naive Bayes) and SNoW classifiers use the same feature set, conjunctions of size 3 of POS tags (+ words) in a window of size 6 around the target word.</Paragraph>
    <Paragraph position="2"> The first important observation is that the SV task is significantly more difficult than the NP task. This is consistent for all models and all features sets. When comparing between different models and features sets, it is clear that the simple HMM formalism is not competitive with the other two models. What is interesting here is the very significant sensitivity to the wider notion of observations (features) used by the classifiers, despite the violation of the probabilistic assumptions. For the easier NP task, the HMM model is competitive with the others when the classifiers used are NB or SNOW.</Paragraph>
    <Paragraph position="3"> In particular, a significant improvement in both probabilistic methods is achieved when their input is given by SNOW.</Paragraph>
    <Paragraph position="4"> Our two main methods, PMM and CSCL, perform very well on predicting NP and SV phrases with CSCL at least as good as any other methods tried on these tasks. Both for NPs and  and comparison to previous works on NP and SV recognition. Notice that, in case of simple, the data with lexical features are too sparse to directly estimate the observation probability so we leave these entries empty.</Paragraph>
  </Section>
  <Section position="7" start_page="109" end_page="109" type="metho">
    <SectionTitle>
SV
</SectionTitle>
    <Paragraph position="0"> SVs, CSCL performs better than the probabilistic method, more significantly on the harder, SV, task. We attribute it to CSCL's ability to cope better with the length of the phrase and the long term dependencies.</Paragraph>
    <Paragraph position="1"> Our methods compare favorably with others with the exception to SV in (Mufioz et al., 1999). Their method is fundamentally similar to our CSCL; however, they incorporated the features from open in the close classifier allowing to exploit the dependencies between two classifiers. We believe that this is the main factor of the significant difference in performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML