File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/p00-1015_intro.xml
Size: 3,447 bytes
Last Modified: 2025-10-06 14:00:53
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1015"> <Title>A Unified Statistical Model for the Identification of English BaseNP</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Finding simple and non-recursive base Noun Phrase (baseNP) is an important subtask for many natural language processing applications, such as partial parsing, information retrieval and machine translation. A baseNP is a simple noun phrase that does not contain other noun phrase recursively, for example, the elements within [...] in the following example are baseNPs, where NNS, IN VBG etc are part-of-speech tags [as defined in M. Marcus 1993].</Paragraph> <Paragraph position="2"> A number of researchers have dealt with the problem of baseNP identification (Church 1988; Bourigault 1992; Voutilainen 1993; Justeson & Katz 1995). Recently some researchers have made experiments with the same test corpus extracted from the 20 th section of the Penn Treebank Wall Street Journal (Penn Treebank). Ramshaw & Markus (1998) applied transform-based error-driven algorithm (Brill 1995) to learn a set of transformation rules, and using those rules to locally updates the bracket positions. Argamon, Dagan & Krymolowski (1998) introduced a memory-based sequences learning method, the training examples are stored and generalization is performed at application time by comparing subsequence of the new text to positive and negative evidence. Cardie & Pierce (1998 1999) devised error driven pruning approach trained on Penn Treebank. It extracts baseNP rules from the training corpus and prune some bad baseNP by incremental training, and then apply the pruned rules to identify baseNP through maximum length matching (or dynamic program algorithm).</Paragraph> <Paragraph position="3"> Most of the prior work treats POS tagging and baseNP identification as two separate procedures. However, uncertainty is involved in both steps. Using the result of the first step as if they are certain will lead to more errors in the second step. A better approach is to consider the two steps together such that the final output takes the uncertainty in both steps together. The approaches proposed by Ramshaw & Markus and Cardie&Pierce are deterministic and local, while Argamon, Dagan & Krymolowski consider the problem globally and assigned a score to each possible baseNP structures.</Paragraph> <Paragraph position="4"> However, they did not consider any lexical information.</Paragraph> <Paragraph position="5"> This paper presents a novel statistical approach to baseNP identification, which considers both steps together within a unified statistical framework. It also takes lexical information into account. In addition, in order to make the best choice for the entire sentence, Viterbi algorithm is applied. Our tests with the Penn Treebank showed that our integrated approach achieves 92.3% in precision and 93.2% in recall. The result is comparable or better that the current state of the art.</Paragraph> <Paragraph position="6"> In the following sections, we will describe the detail for the algorithm, parameter estimation and search algorithms in section 2. The experiment results are given in section 3. In section 4 we make further analysis and comparison. In the final section we give some conclusions.</Paragraph> </Section> class="xml-element"></Paper>