File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0106_intro.xml

Size: 9,918 bytes

Last Modified: 2025-10-06 14:06:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0106">
  <Title>I I I I I, I l Grammar Acquisition Based on Clustering Analysis and Its Application to Statistical Parsing</Title>
  <Section position="4" start_page="31" end_page="31" type="intro">
    <SectionTitle>
I
</SectionTitle>
    <Paragraph position="0"> inside-outside algorithm which w~ originally proposed by Baker\[Bak79\] and was implemented as applications for speech and language in ~Lar90\], \[Per92\] and \[Sch93\]. Although encouraging results were shown in these works, the derived grammars were restricted to Chomsky normal-form CFGs and there were problems of the small size of acceptable trai=~ng corpora and the relatively high computation time required for training the grandams.</Paragraph>
    <Paragraph position="1"> Towards the problems, this paper proposes a new method which can learn a standard CFG with less computational cost by adopting techniques of clustering analysis to construct a context-sensitive probab'distic grammar from a bracketed corpus where nontermlnal labels are not annotated. Another claim of this paper is that statistics from a large bracketed corpus without nonterminal labels combined with clustering techniques can help us construct a probabilistic grammar which produces an accurate natural language statistical parser. In this method, nonterminal labels for brackets in a bracketed corpus can be automatically assigned by making use of local contextual information which is defined as a set of category pairs of left and right words of a constituent in the phrase structure of a sentence. In this research, based on the assumption that not all contexts are useful in every case, effectiveness of contexts is also investigated. By using only effective contexts, it is possible for us to improve training speed and memory space without a sacrifice of accuracy. Finally, a statistical parsing model bawd on the acquired grammar is provided and the performance is shown through some experiments using the WSJ corpus.</Paragraph>
    <Paragraph position="3"> In the past, Theeramunkong\[The96\] proposed a method of grouping brackets in a bracketed corpus (with lexical tags but no nonterminal labels), according to their local contextual information, as a first step towards the automatic acquisition of a context-free grammar. The basic idea is to apply clustering analysis to find out a number of groups of s;m;\]ar brackets in the corpus and then to ~sign each group with a same nonterminal label. Clustering analysis is a generic name of a variety of mathematical methods that can be used to find out which objects in a set are s;mi\]sr.</Paragraph>
    <Paragraph position="4"> Its applications on natural language processing are varied such as in areas of word classification, text categorization and so on \[Per93\]\[Iwa95\]. However, there is still few researches which apply clustering analysis for grammar inference and parsing~Vior95\]. This section gives an explanation of grammar acquisition based on clustering analysis. In the first place, let us consider the following example of the parse strnctures of two sentences in the corpus in figure 1.</Paragraph>
    <Paragraph position="5"> In the parse structures, leaf nodes are given tags while there is no label for intermedLzte nodes.</Paragraph>
    <Paragraph position="6"> Note that each node corresponds to a bracket in the corpus. With this corpus, the grammar learning task corresponds to a process to determ~=e the label for each intermediate node. In other words, this task is concerned with the way to cluster the brackets into some certain groups based on their similarity and give each group a label. For instance, in figure 1, it is reasonable to classify the brackets (c2),(c4) and (c5) into a same group and give them a same label (e.g., NP(noun phrase)). As the result, we obtain three grammar rules: NP ~ (DT)(NN), NP ~ (PR.P$)(NN) and NP ~ (DT)(cl). To do this, the grammar acquisition algorithm operates in five steps as follows.</Paragraph>
    <Paragraph position="8"> 1. Assign a unique label to each node of which lower nodes are assigned labels. At the initial  step, such node is one whose lower nodes are lexical categories. For example, in figure 1, there are three unique labels derived: cl ~ (JJ)(NN), c2 ~ (DT)(NN) and ~ ~ (PRP.~)(NN). This process is performed throughout all parse trees in the corpus.</Paragraph>
    <Paragraph position="9">  2. Calculate the similarity of every pair of the derived labels. 3. Merge the most ~m~lar pair to a single new label(i.e., a label group) and recalculate the slmilarity of this new label with other labels.</Paragraph>
    <Paragraph position="10"> 4. Repeat (3) until a termination condition is detected. Finally, a certain set of label groups is</Paragraph>
    <Paragraph position="12"> A big man slipped on the ice The boy dropped his wallet somewhere</Paragraph>
    <Paragraph position="14"> on the ice and the boy dropped his wallet somewhere 5. Replace labels in each label group with a new label in the corpus. For example, if (DT)(NN) and (PRP$)(NN) are in the same label group, we replace them with a new label (such as NP) in the whole corpus.</Paragraph>
    <Paragraph position="15"> 6. Repeat (1)-(5) until all nodes in the corpus are assigned labels.</Paragraph>
    <Paragraph position="16">  To compute the similarity of labels, the concept of local contextual information is applied. In this work, the local contextual information is defined as categories of the words immediately before and after a label. This information is shown to be powerful for acquiring phrase structures in a sentence in \[Bri92\]. In our prp|iminary experiments, we also found out that the information are potential for characterizing constituents in a sentence.</Paragraph>
    <Paragraph position="18"/>
    <Section position="1" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
2.1 Distributional Sim;larity
</SectionTitle>
      <Paragraph position="0"> While there are a number of measures which can be used for representing the sir-ilarity of labels in the step 2, measures which make use of relative entropy (Kullback-Leibler distance) are of practical interest and scientific. One of these measures is divergence which has a symmetrical property. Its application on natural language processing was firstly proposed by Harris\[Hat51\] and was shown successfully for detecting phrase structures in \[Bri92\]\[Per93\]. Basically, divergence, as well as relative entropy, is not exactly s'nnilarity measure instead it indicates distributional dissimilarity.</Paragraph>
      <Paragraph position="1"> That means the large value it gets, the less similarity it means. The detail of divergence is iUustrated below.</Paragraph>
      <Paragraph position="2"> Let PC/I and Pc= be two probability distributions of labels cI and ~ over contexts, CT The relative entropy between PC/~ and P(c)= is:</Paragraph>
      <Paragraph position="4"> Relative entropy D(Pc~ \[\[Pc2) is a measure of the amount of extra information beyond P(c)~ needed to describe Pc2- The divergence between Poe and PC/2 is defined as D(Pc~ \]lPc~)+D(Pc~\]lPcz), and is a measure of how di~icult it is to distinguish between the two distributions. The context is defined as a pair of words immediately before and after a label(bracket). Any two labels are considered to be identical when they are distributionally siml\]~.r, i.e., the divergence is low. From the practical point view, this measure addresses a problem of sparseness in limited data. Particularly, when p(eJcz) is zero, we cannot calculate the divergence of two probability distributions because the denomi-ator becomes zero. To cope with this problem, the original probability can be modified by a popular technique into the following formula.</Paragraph>
      <Paragraph position="6"> where, N(~) and N(c~, e) are the occurrence frequency of ~ and (~, e), respectively. IOrl is the number of possible contexts and A is an interpolation coefficient. As defin~-g contexts by the left and right lexical categories, \[CT\[ is the square of the number of existing lexical categories. In the formula, the first term means the original estimated probability and the second term expresses a uniform distribution, where the probability of all events is estimated to a fixed --~form number.</Paragraph>
      <Paragraph position="7"> is applied as a balancing weight between the observed distribution and the -=iform distribution.</Paragraph>
      <Paragraph position="8"> In our experimental results, A is assigned with ~ value of 0.6 which seems to make a good estimate.</Paragraph>
    </Section>
    <Section position="2" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
2.2 Termination Condition
</SectionTitle>
      <Paragraph position="0"> During iteratively merging the most slm~l~r labels, all labels will finally be gathered to a single group. Due to this, a criterion is needed for determining whether this merging process should be continued or terminated. In this section, we describe ~ criterion named differential entropy which is a measure of entropy (perplexity) fluctuation before and after merging a pah- of labels. Let cl and c2 be the most similar pair of labels. Also let cs be the result label p(e\[cl), p(e\[c2) and p(e\]c3) are probability distributions over contexts e of cl, c2 and ~, respectively, p(cl), p(c2) and p(c3) are estimated probabilities of cl, c2 and ca, respectively. The differential entropy (DE) is defined as follows.</Paragraph>
      <Paragraph position="2"> where ~ep(elc/) log P(elc/) is the total entropy over various contexts of label c~. The larger DE is, the larger the information fluctuation before and after merging becomes. In general, a small fluctuation is preferred to s larger one because when DE is large, the current merging process introduces a large amount of information fluctuation and its reliability becomes low.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML