File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0106_metho.xml
Size: 17,139 bytes
Last Modified: 2025-10-06 14:14:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0106"> <Title>I I I I I, I l Grammar Acquisition Based on Clustering Analysis and Its Application to Statistical Parsing</Title> <Section position="5" start_page="31" end_page="37" type="metho"> <SectionTitle> 3 Local Context Effectiveness </SectionTitle> <Paragraph position="0"> As the s~ml\]~rity of any two labels is estimated based on local contextual information which is defined by a set of category pairs of left aad right words, there is an interesting question of which contexts are useful for calculation of s~ml\]~rity. In the past, effectiveness of contexts is indicated in some previous researches \[Bar95\]. One of suitable measures for representing effectiveness of a context is dispersion of the context on labels. This measure expresses that the number of useful contexts should be diverse for different labels. From this, the effectiveness (E) of a context (c) can be defined using variance as follow:</Paragraph> <Paragraph position="2"> where A is a set of all labels and a is one of its individual members. N(a, c) is the number of times a label a and a context c are cooccurred. N(c) is an averaged value of NCa, c) on a label a. In order to take large advantage of context in clustering, it is preferable to choose a context c with a high value of E(c) because this context trends to have a high discrlm~nation for characterising labels. 1~aD~n~ the contexts by the effectiveness value E, some rank higher contexts are selected for elustering the labels instead of all contexts. This enables us to decrease computation time and space without sacrificing the accuracy of the clustering results and sometimes also helps us to remove some noises due to useless contexts. Some experiments were done to support this assumption and their results are shown in the next section.</Paragraph> <Paragraph position="4"> This section describes a statistical parsing model which takes a sentence as input and produce a phrase-structure tree as output: In this problem, there are two components taken into account: a statistical model and parsing process. The model assigns a probability to every candidate parse tree for a sentence. Formally, given a sentence S and a tree T, the model estimates the conditional probability P(T\[S). The most likely parse under the model is argma,zrP(T\[S ) and the parsing process is a method to find this parse. ~Vhile a model of a simple probabilistic CFG applies the probability of a parse which is defined as the multiplication of the probability of all applied rules, however, for the purposes of our model where left and right contexts of a constituent are taken into account, the model estimates P(T\[S) by ass-m~-g that each rule are dependent not only on the occurrences of the rule but also on its left and right context as follow. l</Paragraph> <Paragraph position="6"> where r~ is an application rule in the tree and ~ is the left and right contexts at the place the rule is applied. SimS|at to most probabilistic models and our clustering process, there is a problem of low-frequency events in this model. Although some statistical NL applications apply backing-off estimation techniques to handle low-frequency events, our model uses a simple interpolation estimation by adding a lmlform probability to every event. Moreover, we make use of the geometric mean of the probability instead of the original probability in order to ~|imlnate the effect of the number of rule applications as done in \[Mag91\]. The modified model is:</Paragraph> <Paragraph position="8"> Here, a is a balancing weight between the observed distribution and the uniform distribution and it is assigned with 0.95 in our experiments. The applied parsing algorithm is a simple bottom-up chart parser whose scoring function is based on this model. The grammar used is the one trained by the algorithm described in section 2. A dynamic programming algorithm is used: if there are two proposed constituents which span the same set of words and have the same Isbel, then the lower probability constituent can be safely discarded.</Paragraph> <Paragraph position="9"> I 5 Experimental Evaluation To give some support to our su~ested grammar acquisition metllod and statistical parsing model, three following evaluation experiments are made. The experiments use texts from the Wall Street Journal (WSJ) Corpus and its bracketed version provided by the Penn 'rreebank. Out of nearly 48,000 sentences(i,222,065 words), we extracted 46,000 sentences(I,172,710 words) as possible material source for traiuing a grammar and 2000 sentences(49,355 words) as source for testing.</Paragraph> <Paragraph position="10"> The first experiment involves an evaluation of performance of our proposed grammar learning method shown in the section 2. In this prp\]imi~ary experiment, only rules which have lexical categories as their right hand side are considered and the acquired nontermlnal labels are compared with those assigned in the WSJ corpus. The second experiment stands for investigating effectiveness of contexts described in section 3. The purpose is to find out useful contexts and use them instead of all contexts based on the assumption that not all contexts are useful for clustering brackets in grammar acquisition. Reducing the number of contexts will help us to improve the computation time and space. The last experiment is carried out for evaluating the whole grammar which is learned based on local contextual information and indicating the performance of our statistical parsing model using the acquired grammar. The measures used for this evaJuation are bracketing recall, precision and crossing.</Paragraph> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 5.1 Evaluation of Clustering in Grammar Acquisition </SectionTitle> <Paragraph position="0"> This subsection shows some results of our preliminary experiments to confirm effectiveness of the proposed grammar acquisition techniques. The grammar is learned from the WSJ bracketed corpus where all nonterm~nals are omitted. In this experiment, we focus on only the rules with I~ c~egories as th~ ~ght h~d ~de. For ~tance, ci -~ (~J)(~N), c2 -~ (DT)(NN) and Cs --~ (P.RP$)(N.N') in figure 1. Due to the reason of computation time and space, we use the rule tokens which appear more than 500 times in the corpus. The number of initial rules is 51.</Paragraph> <Paragraph position="1"> From these rules, the most similar pair is calculated and merged to a new label The merging process is cached out in an iterative way. In each iterative step of the merging process, differential entropies are calculated. During the merging process, there are some sharp pealr~ indicating the rapid fluctuation of entropy. These sharp peaks can be used as a step to terrnln~te the merging process. In the experhnents, a peak with .DE ~> 0.12 is applied. As the result, the process is halted up at the 45th step and 6 groups are obtained.</Paragraph> <Paragraph position="2"> This result is evaluated by comparing the system's result with nontermlnal symbols given in the WSJ corpus. The evaluation method utilizes a contingency table model which is introduced in\[Swe69\] and widely used in Information Retrieval and Psychology\[Aga95\]\[lwa95\]. The following measures are considered.</Paragraph> <Paragraph position="3"> where a is the number of the label pairs which the WS3 corpus assigns in the same group and so does the system, 5 is the number of the pairs which the ~WSJ corpus does not assign in the same group but the system does, c is the number of the pairs which the WSJ assigned but the system does not, and d is the number of the pairs which both the WSJ and the system does not assign in the same group. The F-measure is used as a combined measure of recall and precision, where /3 is the weight of recall relative to precision. Here, we use/5 ---- 1.0, equal weight.</Paragraph> <Paragraph position="4"> The result shows 0.93 ~o PR, 0.93 ~o PP, 0.92 ~0 ~ 0.92 % I~P and 0.93 % FM, which are all relativeiy good va/ues. Especially, PP shows that almost all same labels in the WSJ are assigned in same groups. In order to investigate whether the application of differentia/entropy to cut off the merging process is appropriate, we plot values of these measures at all merging steps as shown in figure 2. From the graphs, we found out that the best solution is located at around 44th-45th merging steps. This is consistent with the grouping result of our approach. Moreover, the precision equals 100 % from 1st-38nd steps, indicating that the merging process is suitable.</Paragraph> <Paragraph position="6"> II : A~ n-_-a : : : : : ~ ...... i:! &quot;.ji' I,' i iv-r'luC/.~l~ .... -: ! i i:..~J/ ~ i,~ J l IV-Precisioh ..... ~ ~ ~ zi',.~_l, i i :&quot;~ ii I.~ ..... ~:-nF~s~r~ .... ~. ........ .~ ......... ~:.l....; ......... ~ ......... ~C/...~.i~. li - : .... &quot;T&quot;---&quot; -: : : :deg*. \[ !: : : , %: i ..... :/ .... = lv : I1&quot;/: : , :; .... . * .../~V....h**~ ~deg'deg .. * . o deg~ . :.</Paragraph> <Paragraph position="7"> i i i i i/ i i :ii |~ ~ ~ ~ ~&quot; I~ ~ ~ ~ &quot; ~ t..... : : : :._.~deg : : : : ; ~ : : : .;,, : : : : : * ~:; ....... : ......... : ......... -. ....... F.-- ........ T .................................... : ...... '~'*==&quot; : : : &quot; : * : ~I ~ l : : ~&quot; : __J : ~| \[ ........ , ,</Paragraph> <Paragraph position="9"> As another experiment, we e,y~mine elfectiveness of contexts in the clustering process in order to reduce the computation time and space. Variance is used for expressing effectiveness of a context. The assumption is that a context with has the highest variance is the most effective.</Paragraph> <Paragraph position="10"> The experiment is done by selecting the top jV of contexts and use it instead of all contexts in the clustering process.</Paragraph> <Paragraph position="11"> Besides cases of .N = 10, 50, 200, 400 and ali(2401), a case that 200 contexts are randomly chosen from all contexts, is taken into account in order to e~arnlne the assumption that vaxiance is efficient. In this case, 3 trials are made and the average value is employed. Due to the limit of paper space, we show only F-measure in figure 3. The graphs tell us that the case of top 200 seems superior to the case of 200 random contexts in a11 merging step. This means that variance seems to be a good measure for selecting a set of effective contexts in the clustering process.</Paragraph> <Paragraph position="12"> l~rthermore, we can observe that a high accuracy can be achieved even if not a11 contexts are taXen into account. From this result, the best F-measures are a11 0.93 and the number of groups axe 2, 5, 5 and 6 for each case, i.e., 10, 50, 200 and 400. Except of the case of 10, a11 cases show a good restdt compared with all contexts (0.93, 6 groups). This resnlt tells us that it is reasonable to select contexts with large values of ~raxiance to ones with small v'4riance and a relatively \]axge number of contexts are enough for the clustering process. By pr~|im;nary experiments, we found out that the following criterion is sufficient for determining the number of contexts. Contexts axe selected in the order of their varLznce and a context wi\]1 be accepted when its variance is more than 10 % of the average v~iance of the previously selected contexts.</Paragraph> </Section> <Section position="2" start_page="35" end_page="37" type="sub_section"> <SectionTitle> 5.3 Performance of Statistical Parsing Model </SectionTitle> <Paragraph position="0"> Utilizing top N contexts, we learn the whole grammar based on the algorithm given in section 2. Brackets(rules) which are occurred more than 40 times in the corpus are considered and the number of contexts used is determ;ned by the criterion described in the previous subsection. As the result of the grammar acquisition process, 1396 rtfles are acquized. These rules axe attached with the conditionalprobability based on contexts (the left and fight categories of the rules). The</Paragraph> <Paragraph position="2"> ........ i ......... ! ......... ~ ......... .:- ....... 4-----~-!-.'i-:----,~ ....... /-.:.- ..... i .......</Paragraph> <Paragraph position="3"> ........ : ......... .: ......... ~ ...... ~..~.~-.:/... ~.........=.....-: ......... ! ........</Paragraph> <Paragraph position="4"> chart parser tries to find the best parse of the sentence. 46,000 sentences are used for training a grammar and 2000 sentences are for a test set. To evaluate the performance, the PA.I~.SEVAL measures as defined in \[Bla91\] are used:</Paragraph> <Paragraph position="6"> number of correct brackets in proposed parses number of brackets in proposed parses Recall = number of correct brackets in proposed parses number of brackets in treebank parses The parser generates the most likely parse based on context-seusitive condition probability of the grammar. Among 2000 test sentences, only 1874 sentences can be parsed owing to two following reasons: (1) our algorithm considers rules which occur more than 40 times in the corpus, (2) test sentences have different characteristics from training sentences. Table 1 displays the detail results of our statistical pexser evaluated against the WSJ corpus.</Paragraph> <Paragraph position="7"> 93 ~0 of sentences can be parsed with 71 ~ recall, 52 ~0 precision aud 4.5 crossings per sentence. For short sentences (3-9 words), the parser achieves up to 88 % recall and 71% precision with only 0.71 crossings. For moderately long sentences (10-19 and 20-30 words), it works with 60-71 % recall and 41-51% precision. ~om this result, the proposed parsing model is shown to succeed with high bracketing recalls to some degree. Although our parser cannot achieve good precision, it is not so a serious problem because our parser tries to give more detail bracketing for a sentence them that given in the WSJ corpus. In the next section, the comparison with other reseaxches will be discussed.</Paragraph> </Section> </Section> <Section position="6" start_page="37" end_page="37" type="metho"> <SectionTitle> 6 Related Works and Discussion </SectionTitle> <Paragraph position="0"> In this section, our approach is compared with some previous interesting methods. These methods can be classified into non-grammar-based and grammar-based approaches. For non-grammaz-based approaches, the most successful probabifistic parser named SPATTER is proposed by Magerman\[Mag95\]. The parser is constructed by using decision-tree learning techniques and can succeed up to 86-90 % of bracketing accuracy(both recall and precision) when tr~ing with the WSJ corpus, a fully-parsed corpus with nontermlnvJ labels. Later Collins\[Col96\] introduced a statistical parser which is based on probabilities of bigzam dependencies between head-words in a parse tree. At least the same accuracy as SPATTER was acquired for this parser. These two methods ufflized a corpus which includes both lexical categories and nontermi~al categories. However, it seems a hard task to assign nontermlnsl labels for a corpus and the way to assign a nonterminal label to each constituent in the parsed sentence is arduous and arbitrary. It follows that it is worth trying to infer a grammar from corpora without nontermlnal labels.</Paragraph> <Paragraph position="1"> One of the most promising results of grammar inference based on grammar-based approaches is the inside-outside algorithm proposed by Laxi\[Lazg0\] to construct the gr~.mmax from unbracketed corpus. This algorithm is an extension of forward-backward algorithm which infers the parameters of a stochastic context-free grammar. In this research the acquired grammar is elr~.luated based on its entropy or perplexity where the accuracy of parsing is not taken into account. As another research, Pereira and Schabes\[Per921\[Sch93 \] proposed a modified method to infer a stochastic gran~ar from a partially parsed corpus and evaluated the results with a bracketed corpus. This approach gained up to around 90 % bracketing recall for short sentences(0-15 words) but it sut~ered with a large amount ambiguity for long ones(20-30) where 70 % recall is gained. The acquired gr~mrn~T is normally in Chomsky .normal-form which is a special case of gr~mTnar although he claimed that all of CFGs can be in this form. This type of the gr=tmrnar makes all output parses of this method be in the form of binary-branrMng trees and then the bracketing precision cannot be taken into account because correct parses in the corpus need not be in this form. On the other hand, our proposed approach can learn a standard CFG with 88 % recall for short sentences and 60 % recall for long ones. This result shows that our method gets the same level of accuracy as the inside-outside algorithm does. However, our approach can learn a gr~tmm~.~, which is not restricted to Chomsky normal-form and performs with leas computational cost compared with the approaches applying the inside-outside algorithm.</Paragraph> </Section> class="xml-element"></Paper>