File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/p05-1045_relat.xml
Size: 5,227 bytes
Last Modified: 2025-10-06 14:15:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1045"> <Title>Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling</Title> <Section position="8" start_page="368" end_page="369" type="relat"> <SectionTitle> 8 Related Work </SectionTitle> <Paragraph position="0"> Several authors have successfully incorporated a label consistency constraint into probabilistic sequence model named entity recognition systems.</Paragraph> <Paragraph position="1"> Mikheev et al. (1999) and Finkel et al. (2004) incorporate label consistency information by using ad-hoc multi-stage labeling procedures that are effective but special-purpose. Malouf (2002) and Curran and Clark (2003) condition the label of a token at a particular position on the label of the most recent previous instance of that same token in a prior sentence of the same document. Note that this violates the Markov property, but is achieved by slightly relaxing the requirement of exact inference. Instead of finding the maximum likelihood sequence over the entire document, they classify one sentence at a time, allowing them to condition on the maximum likelihood sequence of previous sentences. This approach is quite effective for enforcing label consistency in many NLP tasks, however, it permits a forward flow of information only, which is not sufficient for all cases of interest. Chieu and Ng (2002) propose a solution to this problem: for each token, they define additional features taken from other occurrences of the same token in the document.</Paragraph> <Paragraph position="2"> This approach has the added advantage of allowing the training procedure to automatically learn good weightings for these &quot;global&quot; features relative to the local ones. However, this approach cannot easily be extended to incorporate other types of non-local structure.</Paragraph> <Paragraph position="3"> The most relevant prior works are Bunescu and Mooney (2004), who use a Relational Markov Network (RMN) (Taskar et al., 2002) to explicitly models long-distance dependencies, and Sutton and Mc-Callum (2004), who introduce skip-chain CRFs, which maintain the underlying CRF sequence model (which (Bunescu and Mooney, 2004) lack) while adding skip edges between distant nodes. Unfortunately, in the RMN model, the dependencies must be defined in the model structure before doing any inference, and so the authors use crude heuristic part-of-speech patterns, and then add dependencies between these text spans using clique templates.</Paragraph> <Paragraph position="4"> This generates a extremely large number of overlapping candidate entities, which then necessitates additional templates to enforce the constraint that text subsequences cannot both be different entities, something that is more naturally modeled by a CRF.</Paragraph> <Paragraph position="5"> Another disadvantage of this approach is that it uses loopy belief propagation and a voted perceptron for approximate learning and inference - ill-founded and inherently unstable algorithms which are noted by the authors to have caused convergence problems. In the skip-chain CRFs model, the decision of which nodes to connect is also made heuristically, and because the authors focus on named entity recognition, they chose to connect all pairs of identical capitalized words. They also utilize loopy belief propagation for approximate learning and inference.</Paragraph> <Paragraph position="6"> While the technique we propose is similar mathematically and in spirit to the above approaches, it differs in some important ways. Our model is implemented by adding additional constraints into the model at inference time, and does not require the preprocessing step necessary in the two previously mentioned works. This allows for a broader class of long-distance dependencies, because we do not need to make any initial assumptions about which nodes should be connected, and is helpful when you wish to model relationships between nodes which are the same class, but may not be similar in any other way.</Paragraph> <Paragraph position="7"> For instance, in the CMU Seminar Announcements dataset, we can normalize all entities labeled as a start time and penalize the model if multiple, nonconsistent times are labeled. This type of constraint cannot be modeled in an RMN or a skip-CRF, because it requires the knowledge that both entities are given the same class label.</Paragraph> <Paragraph position="8"> We also allow dependencies between multi-word phrases, and not just single words. Additionally, our model can be applied on top of a pre-existing trained sequence model. As such, our method does not require complex training procedures, and can instead leverage all of the established methods for training high accuracy sequence models. It can indeed be used in conjunction with any statistical hidden state sequence model: HMMs, CMMs, CRFs, or even heuristic models. Third, our technique employs Gibbs sampling for approximate inference, a simple and probabilistically well-founded algorithm. As a consequence of these differences, our approach is easier to understand, implement, and adapt to new applications.</Paragraph> </Section> class="xml-element"></Paper>