File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2004_metho.xml
Size: 10,498 bytes
Last Modified: 2025-10-06 14:09:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2004"> <Title>Jointly Labeling Multiple Sequences: A Factorial HMM Approach</Title> <Section position="3" start_page="19" end_page="20" type="metho"> <SectionTitle> 2 Factorial HMM </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 2.1 Basic Factorial HMM </SectionTitle> <Paragraph position="0"> A Factorial Hidden Markov Model (FHMM) is a hidden Markov model with a distributed state representation. Let x1:T be a length T sequence of observed random variables (e.g. words) and y1:T and z1:T be the corresponding sequences of hidden state variables (e.g. tags, chunks). Then we define the FHMM as the probabilistic model:</Paragraph> <Paragraph position="2"> where pi0 = p(x0|y0,z0)p(y0|z0)p(z0). Viewed as a generative process, we can say that the chunk model p(zt|zt[?]1) generates chunks depending on the previous chunk label, the tag model p(yt|yt[?]1,zt) generates tags based on the previous tag and current chunk, and the word model p(xt|yt,zt) generates words using the tag and chunk at the same time-step.</Paragraph> <Paragraph position="3"> This equation corresponds to the graphical model of Figure 1. Although the original FHMM developed by Ghahramani (1997) does not explicitly model the dependencies between the two hidden state sequences, here we add the edges between the y and z nodes to reflect the interaction between tag and chunk sequences. Note that the FHMM can be collapsed into a hidden Markov model where the hidden state is the cross-product of the distributed states y and z. Despite this equivalence, the FHMM is advantageous because it requires the estimation of substantiatially fewer parameters.</Paragraph> <Paragraph position="4"> FHMM parameters can be calculated via maximum likelihood (ML) estimation if the values of the hidden states are available in the training data. Otherwise, parameters must be learned using approximate inference algorithms (e.g. Gibbs sampling, variational inference), since exact Expectation-Maximization (EM) algorithm is computationally intractable (Ghahramani and Jordan, 1997). Given a test sentence, inference of the corresponding tag/chunk sequence is found by the Viterbi algorithm, which finds the tag/chunk sequence that maximizes the joint probability, i.e.</Paragraph> <Paragraph position="5"> arg maxy</Paragraph> <Paragraph position="7"/> </Section> <Section position="2" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 2.2 Adding Cross-Sequence Dependencies </SectionTitle> <Paragraph position="0"> Many other structures exist in the FHMM framework. Statistical modeling often involves the iterative process of finding the best set of dependencies that characterizes the data effectively. As shown in Figures 2(a), 2(b), and 2(c), dependen- null cies can be added between the yt and zt[?]1, between zt and yt[?]1, or both. The model in Fig. 2(a) corresponds to changing the tag model in Eq. 1 to p(yt|yt[?]1,zt,zt[?]1); Fig. 2(b) corresponds to changing the chunk model to p(zt|zt[?]1,yt[?]1); Fig. 2(c), corresponds to changing both tag and chunk models, leading to the probability model:</Paragraph> <Paragraph position="2"> We name the models in Figs. 2(a) and 2(b) as FHMM-T and FHMM-C due to the added dependencies to the tag and chunk models, respectively. The model of Fig. 2(c) and Eq. 3 will be referred to as FHMM-CT. Intuitively, the added dependencies will improve the predictive power across chunk and tag sequences, provided that enough training data are available for robust parameter estimation.</Paragraph> </Section> </Section> <Section position="4" start_page="20" end_page="21" type="metho"> <SectionTitle> 3 Switching Factorial HMM </SectionTitle> <Paragraph position="0"> A reasonable question to ask is, &quot;How exactly does the chunk sequence interact with the tag sequence?&quot; The approach of adding dependencies in Section 2.2 acknowledges the existence of cross-sequence interactions but does not explicitly specify the type of interaction. It relies on statistical learning to find the salient dependencies, but such an approach is feasable only when sufficient data are available for parameter estimation.</Paragraph> <Paragraph position="1"> To answer the question, we consider how the chunk sequence affects the generative process for tags: First, we can expect that the unigram distribution of tags changes depending on whether the chunk is a noun phrase or verb phrase. (In a noun phrase, nouns and adjective tags are more common; in a verb phrase, verbs and adverb tags are more frequent.) Similarly, a bigram distribution p(yt|yt[?]1) describing tag transition probabilities differs depending on the bigram's location in the chunk sequence, such as whether it is within a noun phrase, verb phrase, or at a phrase boundary. In other words, the chunk sequence interacts with tags by switching the particular generative process for tags. We model this interaction explicitly using a Switching FHMM:</Paragraph> <Paragraph position="3"> In this new model, the chunk and tag are now generated by bigram distributions parameterized by a and b. For different values of a (or b), we have different distributions for p(yt|yt[?]1) (or p(zt|zt[?]1)). The crucial aspect of the model lies in a function a = f(z1:t), which summarizes information in z1:t that is relevant for the generation of y, and a function b = g(y1:t), which captures information in y1:t that is relevant to the generation of z.</Paragraph> <Paragraph position="4"> In general, the functions f(*) and g(*) partition the space of all tag or chunk sequences into several equivalence classes, such that all instances of an equivalence class give rise to the same generative model for the cross sequence. For instance, all consecutive chunk labels that indicate a noun phrase can be mapped to one equivalence class, while labels that indicate verb phrase can be mapped to another.</Paragraph> <Paragraph position="5"> The mapping can be specified manually or learned automatically. Section 5 discusses a linguistically-motivated mapping that is used for the experiments. Once the mappings are defined, the parameters pa(yt|yt[?]1) and pb(zt|zt[?]1) are obtained via maximum likelihood estimation in a fashion similar to that of the FHMM. The only exception is that now the training data are partitioned according to the mappings, and each a- and b- specific generative model is estimated separately. Inference of the tags and chunks for a test sentence proceeds similarly to FHMM inference. We call this model a Switching FHMM since the distribution of a hidden sequence &quot;switches&quot; dynamically depending on the values of the other hidden sequence.</Paragraph> <Paragraph position="6"> An idea related to the Switching FHMM is the Bayesian Multinet (Geiger and Heckerman, 1996; Bilmes, 2000), which allows the dynamic switching of conditional variables. It can be used to implement switching from a higher-order model to a lower-order model, a form of backoff smoothing for dealing with data sparsity. The Switching FHMM differs in that it switches among models of the same order, but these models represent different generative processes. The result is that the model no longer requires a time-homogenous assumption for state transitions; rather, the transition probabilities change dynamically depending on the influence across sequences. null</Paragraph> </Section> <Section position="5" start_page="21" end_page="21" type="metho"> <SectionTitle> 4 POS Tagging and NP Chunking </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.1 The Tasks </SectionTitle> <Paragraph position="0"> POS tagging is the task of assigning words the correct part-of-speech, and is often the first stage of various natural language processing tasks. As a result, POS tagging has been one of the most active areas of research, and many statistical and rule-based approach have been tried. The most notable of these include the trigram HMM tagger (Brants, 2000), maximum entropy tagger (Ratnaparkhi, 1996), transformation-based tagger (Brill, 1995), and cyclic dependency networks (Toutanova et al., 2003).</Paragraph> <Paragraph position="1"> Accuracy numbers for POS tagging are often reported in the range of 95% to 97%. Although this may seem high, note that a tagger with 97% accuracy has only a 63% chance of getting all tags in a 15-word sentence correct, whereas a 98% accurate tagger has 74% (Manning and Sch&quot;utze, 1999). Therefore, small improvements can be significant, especially if downstream processing requires correctly-tagged sentences. One of the most difficult problems with POS tagging is the handling of out-of-vocabulary words.</Paragraph> <Paragraph position="2"> Noun-phrase (NP) chunking is the task of finding the non-recursive (base) noun-phrases of sentences.</Paragraph> <Paragraph position="3"> This segmentation task can be achieved by assigning words in a sentence to one of three tokens: B for &quot;Begin-NP&quot;, I for &quot;Inside-NP&quot;, or O for &quot;Outside-NP&quot; (Ramshaw and Marcus, 1995). The &quot;Begin-NP&quot; token is used in the case when an NP chunk is immediately followed by another NP chunk. The state-of-the-art chunkers report F1 scores of 93%94% and accuracies of 87%-97%. See, for example, NP chunkers utilizing conditional random fields (Sha and Pereira, 2003) and support vector machines (Kudo and Matsumoto, 2001).</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.2 Data </SectionTitle> <Paragraph position="0"> The data comes from the CoNLL 2000 shared task (Sang and Buchholz, 2000), which consists of sentences from the Penn Treebank Wall Street Journal corpus (Marcus et al., 1993). The training set contains a total of 8936 sentences with 19k unique vocabulary. The test set contains 2012 sentences and 8k vocabulary. The out-of-vocabulary rate is 7%.</Paragraph> <Paragraph position="1"> There are 45 different POS tags and 3 different NP labels in the original data. An example sentence with POS and NP tags is shown in Table 1.</Paragraph> <Paragraph position="2"> The move could pose a challenge</Paragraph> <Paragraph position="4"/> </Section> </Section> class="xml-element"></Paper>