File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1511_metho.xml
Size: 17,458 bytes
Last Modified: 2025-10-06 14:14:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1511"> <Title>Exploiting Contextual Information in Hypothesis Selection for Grammar Refinement</Title> <Section position="4" start_page="78" end_page="79" type="metho"> <SectionTitle> 2 The Framework of Grammar </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="78" end_page="78" type="sub_section"> <SectionTitle> Development </SectionTitle> <Paragraph position="0"> The proposed framework is composed of two phases: partial grammar acquisition and grammar refinement. The graphical representation of the framework is shown in Figure 1. In the process of grammar development, a partial grammar is automatically acquired in the first phase and then it is refined in the second phase. In the latter phase, the system generates new rules and ranks them in the order of priority before displaying a user a list of plausible rules as candidates for refining the grammar. Then the user can select the best one among these rules.</Paragraph> <Paragraph position="1"> Currently, the corpus used for grammar development in the framework is EDR corpus(EDR, 1994) where lexical tags and bracketings are assigned for words and phrase structures of sentences in the corpus respectively but no nonterminal labels are given.</Paragraph> </Section> <Section position="2" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 2.1 Partial Grammar Acquisition </SectionTitle> <Paragraph position="0"> In this section, we give a brief explanation for partial grammar acquisition. More detail can be found in (Theeramunkong and Okumura, 1996). In partial grammar acquisition, a rough grammar is constructed from the corpus based on clustering analysis. As mentioned above, the corpus used is a tagged corpus with phrase structures marked with brackets. At the first place, brackets covering a same sequence of categories, are assumed to have a same nonterminal label. We say they have the same bracket type. The basic idea is to group brackets (bracket types) in a corpus into a number of similar bracket groups. Then the corpus is automatically labeled with some nonterminal labels, and consequently a grammar is acquired. The similarity between any two bracket types is calculated based on divergencel(Harris, 1951) by utilizing local contextual information which is defined as a pair of categories of words immediately before and after a bracket type. This approach was evaluated through some experiments and the obtained result was almost consistent with that given by human evaluatots. However, in this approach, when the number of occurrences of a bracket type is low, the similarity between this bracket type and other bracket types is not so reliable. Due to this, only bracket types with relatively frequent occurrence are taken into account. To deal with rarely occurred bracket types, we develop the second phase where the system shows some candidates to grammar developers and then they can determine the best one among these candidates, as shown in the next section.</Paragraph> </Section> <Section position="3" start_page="78" end_page="79" type="sub_section"> <SectionTitle> 2.2 Grammar Refinement with Additional Hypothesis Rule </SectionTitle> <Paragraph position="0"> The grammar acquired in the previous phase is a partial one. It is insufficient for analyzing all sentences in the corpus and then the parser fails to produce any complete parse for some sentences. In order to deal with these unparsable sentences, we modify the conventional chart parser to keep record of all inactive edges as partial parsing results. Two processes are provoked to find the possible plausible interpretations of an unparsable sentence by hypothesizing some new rules and later to add them to the current grammar. These processes are (1) the rule-based process, which detects incompleteness of the current grammar and generates a set of hypotheses of new rules and (2) the corpus-based process, which selects plausible hypotheses based on local contextual information. In the rule-based process, the parser generates partial parses of a sentence as much as possible in bottom-up style under the grammar constraints. Utilizing these parses, the process detects a complete parse of a sentence by starting at top category (i.e., sentence) covering the sentence and then searching down, in top-down manner, to the part of the sentence that cannot form any parse.</Paragraph> <Paragraph position="1"> At this point, a rule is hypothesized. In many cases, there may be several possibilities for hypothesized rules. The corpus-based process, as the second process, uses the probability information from parsable sentences to rank these hypotheses. In this research, local contextual information is taken into account for this task.</Paragraph> </Section> </Section> <Section position="5" start_page="79" end_page="79" type="metho"> <SectionTitle> 3 Hypothesis Generation </SectionTitle> <Paragraph position="0"> When the parser fails to parse a sentence, there exists no inactive edge of category S (sentence) spanning the whole sentence in the parsing result. Then the hypothesis generation process is provoked to find all possible hypotheses in top-down manner by starting at a single hypothesis of the category S covering the whole sentence. This process uses the partial chart constructed during parsing the sentence. This hypothesis generation is similar to one applied in (Kiyono and Tsujii, 1994a).</Paragraph> <Paragraph position="1"> \[Hypothesis generation\] An inactive edge \[is(A) : xo, xn\] can be introduced from x0 to x,, with label A, for each of the hypotheses generated by the following two steps.</Paragraph> <Paragraph position="2"> 1. For each sequence of inactive edges, \[is(B1) : xo, x l \] , ..., \[ ie( Bn ) : Xn- l , agn \] , spanning from x0 to xn, generate a new rule, A ---, Bz, ..., Bn, and propose a new inactive edge as a hypothesis, \[hypo(A) : xo, xn\]. (Figure 2(1)) 2. For each existing rule A --+ A1, ..., An, find an incomplete sequence of inactive edges, \[ie(A1) : xo, xl\],...,\[ie(ai-1) : xi-2, zi-1\], \[ie(Ai+l) : xi, xi+l\], ..., \[ie(An) : xn-z, xn\], and call this algorithm for \[ie(Ai): xl-z, xl\].(Figure 2(2))</Paragraph> <Paragraph position="4"> By this process, all of possible single hypotheses (rules) which enable the parsing process to succeed, are generated. In general, among these rules, most of them may be linguistically unnatural. To filter out such unnatural hypotheses, some syntactical criteria are introduced. For example, (1) the maximum number of daughter constituents of a rule is limited to three, (2) a rule with one daughter is not preferred, (3) non-lexical categories are distinguished from lexical categories and then a rule with lexical categories as its mother is not generated. By these simple syntactical constraints, a lot of useless rules can be discarded.</Paragraph> </Section> <Section position="6" start_page="79" end_page="80" type="metho"> <SectionTitle> 4 Hypothesis Selection with Local </SectionTitle> <Paragraph position="0"> Contextual Information Hypothesis selection utilizes information from local context to rank the rule hypotheses generated in the previous phase. In the hypothesis generation, although we use some syntactical constraints to reduce the number of hypotheses of new rules that should be registered into the current grammar, there may still be several candidates remaining. At this point, a scoring mechanism is needed for ranking these candidates and then one can select the best one as the most plausible hypothesis.</Paragraph> <Paragraph position="1"> This section describes a scoring mechanism which local contextual information can be exploited for this purpose. As mentioned in the previous section, local contextual information referred here is defined as a pair of categories of words immediately before and after the brackets. This information can be used as an environment for characterizing a nonterminal category. The basic idea in hypothesis selection is that the rules with a same nonterminal category as their mother tend to have similar environments. Local contextual information is gathered beforehand from the sentences in the corpus which the current grammar is enough for parsing.</Paragraph> <Paragraph position="2"> When the parser faces with a sentence which cannot be analyzed by the current grammar, some new rule hypotheses are proposed by the hypothesis generator. Then the mother categories of these rules will be compared by checking the similarity with the local contextual information of categories gathered from the parsable sentences. Here, the most likely category is selected and that rule will be the most plausible candidate. The scoring function (probability p) for a rule hypothesis Cat ---* a is defined as follows.</Paragraph> <Paragraph position="4"> where N(Cat, l, r) is the number of times that Cat is occurred in the environment (l, r). I is the category immediately before Cat and r is the lexical category of the word immediately after Cat. N(l, r) is the number of times that i and r are occurred immediately before and after any categories. Note that because it is not possible for us to calculate the probability of Cat ---+ ot in the environment of (l, r), we estimate this number by the probability that Cat occurs in the environment of (l, r). That is, how easy the category Cat appears under a certain environment (l, r).</Paragraph> </Section> <Section position="7" start_page="80" end_page="81" type="metho"> <SectionTitle> 5 The Stochastic Model </SectionTitle> <Paragraph position="0"> This section describes a statistical parsing model which finds the most plausible interpretation of a sentence when a hypothesis is introduced for recovering the parsing process of the sentence. In this problem, there are two components taken into account: a statistical model and parsing process. The model assigns a probability to every candidate parse tree for a sentence. Formally, given a sentence S and a tree T, the model estimates the conditional probability P(TIS). The most likely parse under the model is argrnaxT P(TIS ) and the parsing process is a method to find this parse. In general, a model of a simple probabilistic context free grammar (CFG) applies the probability of a parse which is defined as the multiplication of the probability of all applied rules. However, for the purposes of our model where left and right contexts of a constituent are taken into account, the model can be defined as follows.</Paragraph> <Paragraph position="2"> where rli is an application rule in the tree and l~ and ri are respectively the left and right contexts at the place the rule is applied. In a parsing tree, there is a hypothesis rule for which we cannot calculate the probability because it does not exist in the current grammar. Thus we estimate its probability by using the formula (1) in section 4.</Paragraph> <Paragraph position="3"> Similar to most probabilistic models, there is a problem of low-frequency events in this model.</Paragraph> <Paragraph position="4"> Although some statistical NL applications apply backing-off estimation techniques to handle low-frequency events, our model uses a simple interpolation estimation by adding a uniform probability to every events. Moreover, we make use of the geometric mean of the probability instead of the original probability in order to eliminate the effect of the number of rule applications as done in (Magerman and Marcus, 1991). The modified model is:</Paragraph> <Paragraph position="6"> Here, o~ is a balancing weight between the observed distribution and the uniform distribution. It is assigned with 0.8 in our experiments. Nrl is the number of rules and Nc is the number of possible contexts, i.e., the left and right categories. The applied parsing algorithm is a simple bottom-up chart parser whose scoring function is based on this model. A dynamic programming algorithm is used to find the Viterbi parse: if there are two proposed constituents which span the same set of words and have the same label, then the lower probability constituent can be safely discarded.</Paragraph> </Section> <Section position="8" start_page="81" end_page="81" type="metho"> <SectionTitle> 6 Experimental Evaluation </SectionTitle> <Paragraph position="0"> Some evaluation experiments and their results are described. For the experiments, we use texts from the EDR corpus, where bracketings are given. The subject is 48,100 sentences including around 510,000 words. Figure 3 shows some example sentences in pus The initial grammar is acquired from the same corpus using divergence shown in section 2.1. The number of rules is 272, the maximum length of rules is 4, and the numbers of terminal and nonterminal categories are 18 and 55 respectively. A part of the initial grammar is enumerated in Figure 4. In the grammar, llnl is expected to be noun phrase with an article, lln2 is expected to be noun phrase without an article, and iln3 is expected to be verb phrase. Moreover, among 48,100 sentences, 5,083 sentences cannot be parsed by the grammar. We use these sentences for evaluating our hypothesis selection approach. null</Paragraph> <Section position="1" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 6.1 The Criterion </SectionTitle> <Paragraph position="0"> In the experiments, we use bracket crossing as a criterion for checking the correctness of the generated hypothesis. Each result hypothesis is compared with the brackets given in the EDR corpus. The correctness of a hypothesis is defined as follows.</Paragraph> <Paragraph position="1"> Ranking A/all * At least one of the derivations inside the hy- pothesis include the brackets which do not cross with those given in the corpus * When the hypothesis is applied, it can be used to form a tree whose brackets do not cross with those given in the corpus.</Paragraph> </Section> <Section position="2" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 6.2 Hypothesis Level Evaluation </SectionTitle> <Paragraph position="0"> From 5,083 unparsable sentences, the hypothesis generator can produce some hypotheses for 4,730 sentences (93.1%). After comparing them with the parses in the EDR corpus, the hypothesis sets of 3,127 sentences (61.5 %) include correct hypotheses. Then we consider the sentences for which some correct hypotheses can be generated (i.e., 3,127 sentences) and evaluate our scoring function in selecting the most plausible hypothesis. For each sentence, we rank the generated hypotheses by their preference score according to our scoring function. The result is shown in Table 1. From the table, even though only 12.3 % of the whole generated hypotheses are correct, our hypothesis selection can choose the correct hypothesis for 41.6 % of the whole sentences when the most plausible hypothesis is selected for each sentence. Moreover, 29.8 % of correct hypotheses are ordered at the ranks of 2-5, 24.3 % at the ranks of 6-10 and just only 6.2 % at the ranks of more than 50. This indicates that the hypothesis selection is influential for placing the correct hypotheses at the higher ranks. However, when we consider the top 10 hypotheses, we found out that the accuracy is (1362+3368+3134)/(3217+11288+12846)= 28.8 %. This indicates that there are a lot of hypotheses generated for a sentence. This suggests us to consider the correct hypothesis for each sentence instead of all hypotheses.</Paragraph> </Section> <Section position="3" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 6.3 Sentence Level Evaluation </SectionTitle> <Paragraph position="0"> In this section, we consider the accuracy of our hypothesis selection for each sentence. Table 2 displays the accuracy of hypothesis selection by changing the number of selected hypotheses.</Paragraph> <Paragraph position="1"> From the table, the number of sentences whose best hypothesis is correct, is 1,340 (41.6%) and we can get up to 2,623 (81.5%) accuracy when the top 10 of the ordered hypotheses are considered. The result shows that our hypothesis selection is effective enough to place the correct hypothesis at the higher ranks.</Paragraph> </Section> <Section position="4" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 6.4 Parsing Evaluation </SectionTitle> <Paragraph position="0"> Another experiment is also done for evaluating the parsing accuracy. The parsing model we consider here is one described in section 5. The chart parser outputs the best parse of the sentence. This parse is formed by using grammar rules and a single rule hypothesis. The result is shown in Table 3. In this evaluation, the PARSEVAL measures as defined in (Black and et al., 1991) are used: Precision : number of correct brackets in proposed parses Recall = number of brackets in proposed parses number of correct brackets in proposed parses number of brackets in corpus parses From this result, we found out that the parser can succeed 57.3 % recall and 65.2 % precision for the short sentences (3-9 words). In this case, the averaged crossings are 1.87 per sentence and the number of sentences with less than 2 crossings is 69.2 % of the comparisons. For long sentences not so much advantage is obtained. However, our parser can achieve 51.4 % recall and 56.3 % precision for all unparsable sentences.</Paragraph> </Section> </Section> class="xml-element"></Paper>