File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1015_metho.xml
Size: 12,127 bytes
Last Modified: 2025-10-06 14:07:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1015"> <Title>A Unified Statistical Model for the Identification of English BaseNP</Title> <Section position="3" start_page="0" end_page="121" type="metho"> <SectionTitle> 2 The statistical approach </SectionTitle> <Paragraph position="0"> In this section, we will describe the two-pass statistical model, parameters training and Viterbi algorithm for the search of the best sequences of POS tagging and baseNP identification. Before describing our algorithm, we introduce some notations we will use</Paragraph> <Section position="1" start_page="0" end_page="121" type="sub_section"> <SectionTitle> 2.1 Notation </SectionTitle> <Paragraph position="0"> Let us express an input sentence E as a word sequence and a sequence of POS respectively as follows:</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> Given E, the result of the baseNP identification is assumed to be a sequence, in which some words are grouped into baseNP as follows</Paragraph> <Paragraph position="6"> thought of as a baseNP rule. Therefore B is a sequence of both POS tags and baseNP rules.</Paragraph> <Paragraph position="8"> nnm ,1 (POS tag set [?] baseNP rules set), This is the first expression of a sentence with baseNP annotated. Sometime, we also use the following equivalent form:</Paragraph> <Paragraph position="10"> bm with respect to baseNPs. The positional information is one of },,,,{ SOEIF . F, E and I mean respectively that the word is the left boundary, right boundary of a baseNP, or at another position inside a baseNP. O means that the word is outside a baseNP. S marks a single word baseNP. This second expression is similar to that used in [Marcus 1995].</Paragraph> <Paragraph position="11"> For example, the two expressions of the example given in Figure 1 are as follows: We separate the whole procedure into two passes, i.e.:</Paragraph> <Paragraph position="13"> In order to reduce the search space and computational complexity, we only consider the N best POS tagging of E, i.e.</Paragraph> <Paragraph position="14"> Correspondingly, the algorithm is composed of two steps: determining the N-best POS tagging using Equation (2). And then determining the best baseNP sequence from those POS sequences using Equation (3). One can see that the two steps are integrated together, rather that separated as in the other approaches. Let us now examine the two steps more closely.</Paragraph> </Section> <Section position="2" start_page="121" end_page="121" type="sub_section"> <SectionTitle> 2.3 Determining the N best POS </SectionTitle> <Paragraph position="0"> sequences The goal of the algorithm in the 1 st pass is to search for the N-best POS-sequences within the search space (POS lattice). According to Bayes' Rule, we have transition probability in Hidden Markov Model. 2.3.1 Determining the baseNPs As mentioned before, the goal of the 2 nd pass is to search the best baseNP-sequence given the N-best POS-sequences.</Paragraph> <Paragraph position="1"> Considering E ,T and B as random variables, according to Bayes' Rule, we have Because we search for the best baseNP sequence for each possible POS-sequence of the given sentence E, so constTEPTPTEP =[?]=x )()()|( , Furthermore from the definition of B, during each search procedure, we have</Paragraph> <Paragraph position="3"> To summarize, In the first step, Viterbi N-best searching algorithm is applied in the POS tagging procedure, It determines a path In the second step, for each possible POS tagging result, Viterbi algorithm is applied again to search for the best baseNP sequence. Every baseNP sequence found in this pass is also asssociated with a path probability</Paragraph> <Paragraph position="5"> normalization coefficient (a 4.2= in our experiments). When we determine the best baseNP sequence for the given sentence E , we also determine the best POS sequence of E , which corresponds to the best baseNP of E .</Paragraph> <Paragraph position="6"> Now let us illustrate the whole process through an example: &quot;stock was down 9.1 points yesterday morning.&quot;. In the first pass, one of the N-best POS tagging result of the sentence is: T =</Paragraph> <Paragraph position="8"> pass will try to determine the baseNPs as shown in Figure 2. The details of the path in the dash line are given in Figure 3, Its probability calculated in the second pass is as follows ( Ph is pseudo variable):</Paragraph> </Section> <Section position="3" start_page="121" end_page="121" type="sub_section"> <SectionTitle> 2.4 The statistical parameter training </SectionTitle> <Paragraph position="0"> In this work, the training and testing data were derived from the 25 sections of Penn Treebank.</Paragraph> <Paragraph position="1"> We divided the whole Penn Treebank data into two sections, one for training and the other for testing.</Paragraph> <Paragraph position="2"> As required in our statistical model, we have to calculate the following four probabilities: first and the third parameters are trigrams of T and B respectively. The second and the fourth are lexical generation probabilities. Probabilities</Paragraph> <Paragraph position="4"> As each sentence in the training set has both POS tags and baseNP boundary tags, it can be converted to the two sequences as B (a) and Q (b) described in the last section. Using these sequences, parameters (3) and (4) can be calculated, The calculation formulas are similar with equations (13) and (14) respectively.</Paragraph> <Paragraph position="5"> Before training trigram model (3), all possible baseNP rules should be extracted from the training corpus. For instance, the following three sequences are among the baseNP rules extracted.</Paragraph> <Paragraph position="6"> There are more than 6,000 baseNP rules in the Penn Treebank. When training trigram model (3), we treat those baseNP rules in two ways. (1) Each baseNP rule is assigned a unique identifier (UID). This means that the algorithm considers the corresponding structure of each baseNP rule.</Paragraph> <Paragraph position="7"> (2) All of those rules are assigned to the same identifier (SID). In this case, those rules are grouped into the same class. Nevertheless, the identifiers of baseNP rules are still different from the identifiers assigned to POS tags.</Paragraph> <Paragraph position="8"> We used the approach of Katz (Katz.1987) for parameter smoothing, and build a trigram model to predict the probabilities of parameter (1) and (3). In the case that unknown words are encountered during baseNP identification, we calculate parameter (2) and (4) in the following</Paragraph> </Section> </Section> <Section position="4" start_page="121" end_page="121" type="metho"> <SectionTitle> 3 Experiment result </SectionTitle> <Paragraph position="0"> We designed five experiments as shown in Table 1. &quot;UID&quot; and &quot;SID&quot; mean respectively that an identifier is assigned to each baseNP rule or the same identifier is assigned to all the baseNP rules. &quot;+1&quot; and &quot;+4&quot; denote the number of beat POS sequences retained in the first step. And &quot;UID+R&quot; means the POS tagging result of the given sentence is totally correct for the 2nd step.</Paragraph> <Paragraph position="1"> This provides an ideal upper bound for the system. The reason why we choose N=4 for the N-best POS tagging can be explained in Figure 4, which shows how the precision of POS tagging changes with the number N.</Paragraph> <Paragraph position="2"> In the experiments, the training and testing sets are derived from the 25 sections of Wall Street Journal distributed with the Penn Treebank II, and the definition of baseNP is the same as Ramshaw's, Table 1 summarizes the average performance on both baseNP tagging and POS tagging, each section of the whole Penn Treebank was used as the testing data and the other 24 sections as the training data, in this way we have done the cross validation experiments statistical model on various size of the training data, x-coordinate denotes the size of the training set, where &quot;1&quot; indicates that the training set is from section 0-8 th of Penn Treebank, &quot;2&quot; corresponds to the corpus that add additional three sections 9-11 th into &quot;1&quot; and so on. In this way the size of the training data becomes larger and larger. In those cases the testing data is always section 20 (which is excluded from the training data).</Paragraph> <Paragraph position="3"> From Figure 7, we learned that the POS tagging and baseNP identification are influenced each other. We conducted two experiments to study whether the POS tagging process can make use of baseNP information. One is UID+4, in which the precision of POS tagging dropped slightly with respect to the standard POS tagging with Trigram Viterbi search. In the second experiment SID+4, the precision of POS tagging has increase slightly. This result shows that POS tagging can benefit from baseNP information.</Paragraph> <Paragraph position="4"> Whether or not the baseNP information can improve the precision of POS tagging in our approach is determined by the identifier assignment of the baseNP rules when training trigram model of ),|(</Paragraph> </Section> <Section position="5" start_page="121" end_page="121" type="metho"> <SectionTitle> 12 [?][?] iii </SectionTitle> <Paragraph position="0"> nnnP . In the future, we will further study optimal baseNP rules clustering to further improve the performances of both baseNP identification and POS tagging.</Paragraph> </Section> <Section position="6" start_page="121" end_page="121" type="metho"> <SectionTitle> 4 Comparison with other </SectionTitle> <Paragraph position="0"> approaches To our knowledge, three other approaches to baseNP identification have been evaluated using Penn Treebank-Ramshaw & Marcus's transformation-based chunker, Argamon et al.'s MBSL, and Cardie's Treebank_lex in Table 2, we give a comparison of our method with other these three. In this experiment, we use the testing data prepared by Ramshaw (available at http://www.cs.biu.ac.il/~yuvalk/MBSL), the training data is selected from the 24 sections of Penn Treebank (excluding the section 20). We can see that our method achieves better result than the others our approach and the three other methods. Our statistical model unifies baseNP identification and POS tagging through tracing N-best sequences of POS tagging in the pass of baseNP recognition, while other methods use POS tagging as a pre-processing procedure. From Table 1, if we reviewed 4 best output of POS tagging, rather that only one, the F-measure of baseNP identification is improved from 93.02 % to 93.07%. After considering baseNP information, the error ratio of POS tagging is reduced by 2.4% (comparing SID+4 with SID+1).</Paragraph> <Paragraph position="1"> The transformation-based method (R&M 95) identifies baseNP within a local windows of sentence by matching transformation rules.</Paragraph> <Paragraph position="2"> Similarly to MBSL, the 2 nd pass of our algorithm traces all possible baseNP brackets, and makes global decision through Viterbi searching. On the other hand, unlike MSBL we take lexical information into account. The experiments show that lexical information is very helpful to improve both precision and recall of baseNP recognition. If we neglect the probability of pass of our model, the precision/recall ratios are reduced to 90.0/92.4% from 92.3/93.2%. Cardie's approach to Treebank rule pruning may be regarded as the special case of our statistical model, since the maximum-matching algorithm of baseNP rules is only a simplified processing version of our statistical model. Compared with this rule pruning method, all baseNP rules are kept in our model. Therefore in principle we have less likelihood of failing to recognize baseNP types As to the complexity of algorithm, our approach is determined by the Viterbi algorithm approach, or )(nO , linear with the length.</Paragraph> </Section> class="xml-element"></Paper>