File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0104_metho.xml

Size: 11,280 bytes

Last Modified: 2025-10-06 14:14:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0104">
  <Title>II I A Statistics-Based Chinese Parser</Title>
  <Section position="5" start_page="5" end_page="5" type="metho">
    <SectionTitle>
3 The parsing algorithm i
</SectionTitle>
    <Paragraph position="0"> The aim of the parser is to take a correctly segmented and POS tagged Chinese sentence as input(for ~' example Figure 2(a)) and produce a phrase structure ~ee as output(Figure 2(b)). A parsing algorithm to i~ this problem must deal with two important issues: (1) how to produce the suitable syntactic trees from a U</Paragraph>
    <Paragraph position="2"> tagged word sequence, (2) how to select the best tree from all of the possible parse trees.</Paragraph>
    <Paragraph position="3"> The key of our approach is to simplify the parsing problem as two processing stages. First, the statistical prediction model assigns a suitable constituent boundary tag to every word in the sentence and produce a partially bracketed sentence(Figure 2(c)). Second, the preference matching model constructs the syntactic trees through bracket matching operations and select a preference matched tree using probability score scheme as output(Figure 2(d)).</Paragraph>
    <Paragraph position="4">  (b) A candidate parse-tree(the correct one), represented by its bracketed and labeled form; (c) A constituent boundary' prediction representation of (a); (d) A preference matched tree of (c). Arrows show the bracket matching operations.</Paragraph>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
3.1 The boundary prediction model
</SectionTitle>
      <Paragraph position="0"> A constituent boundary parse of a sentence can be represented by a sequence of boundary tags. Each tag corresponds to one word in the sentence, and can value L, M or .R, respectively meaning the beginning, continuation or termination of a constituent in the syntactic tree. A constituent boundary parse B is therefore given by B = (bl,b2...,bn), where b i is the boundary tag of the//th word and n is the number of</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="5" type="metho">
    <SectionTitle>
2 The POS and syntactic tags use~l !n this sentence are briefly describes as follows. Some detailed information about our POS
</SectionTitle>
    <Paragraph position="0"> and syntactic tagsets can be found in \[ZQd96\]: \[POS tags\]: r-pronoun, n-noun, v-verb, m-numeral, q-classifier, w-punctuation. \[Syn tags\]: np--noun phrase, mp'-numeral-cla.ssifier phrase, vp-verb Phrase, dj-simple sentence panern' zj-'cdegraplete sentence.</Paragraph>
    <Paragraph position="1"> words in the sentence.</Paragraph>
    <Paragraph position="2"> Let S=&lt;W,T&gt; be the input sentence for syntactic analyzing, where W---Wl, W 2 ..... w n is the word sequence in the sentence, and T=tl, t2,...,t n is the corresponding POS tag sequence, i.e., t i is the POS tag ofwi. Just like the statistical approaches in many automatic POS tagging programs, our job is to select a constituent boundary sequence B' with the highest score, P(BIS), from all possible sequences.</Paragraph>
    <Paragraph position="4"> Furthermore, replace P(W1B) and P(2qB) by the approximation that each constituent boundary is determined only by a functional word(wi) or local POS context(Ci).</Paragraph>
    <Paragraph position="6"> In addition, for P(R), it is possible touse simple bigram approximation:</Paragraph>
    <Paragraph position="8"> Therefore, a statistical model for the automatic prediction of constituent boundary is set up.</Paragraph>
    <Paragraph position="10"> The probability estimates of the model are based on the boundary distribution data(S 1) described in section 2, and can be calculated through maximum likelihood estimation(MLE) method. For example,</Paragraph>
    <Paragraph position="12"> There are two directions to improve the prediction model. First, many post-editing rules that are manually developed or automatically learned by an error-driven learning method can be used to refine the automatic prediction .ou~uts\[ZQ96\]. Second, a new statistical model based on forward-backward algorithm will produce multiple bo~fi~ary predictions for a word in the sentence\[ZZ96\].</Paragraph>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
3.2 Basic matching model
</SectionTitle>
      <Paragraph position="0"> In order to build a complete syntactic tree based on the boundary prediction information, two basic problems must be resolved. The first one is how to find the reasonable constituents among the partially bracketed sentence. The second one is how to label the found constituents with suitable syntactic tags.</Paragraph>
      <Paragraph position="1"> This section will propose some basic concepts and operations of the matching model to deal with the first problem, and section 3.3.1 will give methods to resolve the second one. The formal description of the bracket matching model can be found in \[ZQd96\].</Paragraph>
      <Paragraph position="2">  operation SM(ij) or the expanded matching operation EM(ij).</Paragraph>
      <Paragraph position="3"> Therefore, a basic matching algorithm can be built as follows: Starting from the preprocessed sentence S=&lt;W,T,B&gt;, we first use the simple matching operation, then the expanded matching operation, so as to fred every possible matched constituent in the sentence. The complete matching principle will guarantee that this algorithm will produce all matched constituents in the sentence. See \[ZQd96\] for more detailed infornlation of this principle and its formal proof.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
3.3 Matching restriction schemes
</SectionTitle>
      <Paragraph position="0"> The basic matching algorithm based on the complete matching principle is inefficient, because many ungrammatical or unnecessary constituents can be produced by two matching operations. In order to improve the efficiency of the-algodt1~, some matching restriction schemes are needed, which include, (1) to label the matched constituents with reasonable syntactic tags, (2) to set the matching restriction regions, (3) to discard unnecess~try matching operations according to local preference information.</Paragraph>
      <Paragraph position="1">  The aim of labeling approach is to eliminate the ungrammatical matched constituents and label the suitable syntactic tags for the reasonable constituents, according to their internal structure and external context information.</Paragraph>
      <Paragraph position="2"> First, some common erroneous constituent structures can be enumerated under current POS tagset and syntactic tagset. Moreover, many heuristic rules to find ungrammatical constituents can also be summarized according to constituent combination principles. Based on them, most ungrammatical constituents can be eliminated.</Paragraph>
      <Paragraph position="3"> Then, we can assign ~-'suitable.~y~tactic tag to each matched constituent through the following sequential processing steps:  (a) Set the syntactic tags according to &amp;quot;the statistical reduction rule, if it can be searched in syntactic tag reduction data(S2) using the constituent structure string as a keyword. (1:0 Determine the syntactic tags according to the intersection of the tag distribution sets of the open and close bracket on the constituent boundary, if they can be found in statistical data(S3). (c) Assign an especial tag that is not in the current syntactic set to every unlabeled constituent  after above two processing steps.</Paragraph>
      <Paragraph position="4">  There arc many regional restricted constituents in natural language, such as reference constituents in the pair of quotation marks: &amp;quot;... % and the regular collocation phrase: &amp;quot;zai ... de shikou(when ...)&amp;quot; in Chinese. The constituents inside them can not have syntactic relationship with the outside ones. In bracket matching model, these cases can be generalized as a matching restriction region (MRR), which is informally represented as the region &lt;RL, RR&gt; in Figure 3.</Paragraph>
      <Paragraph position="5">  arcs marked with 'X' indicate that such matching operations are forbidden.</Paragraph>
      <Paragraph position="6"> Therefore, the basic matching algorithm can be improved by adding the following restrictions:  (a) To restrict the matching operations inside MRR and guarantee them can't cross the boundary of the MRR.</Paragraph>
      <Paragraph position="7"> (b) To reduce the MRR as a constituent MC(RL,R.R) aitvr all matching operations inside MRR have been finished, so as to make it as a whole during the following matching operations.  The key to use MRR efficiently is to correctly identify the possible restriction regions in the sentences. Reference \[ZQ~i96-\]'describ.e.s the automatic identification methods for some Chinese MRRs. 3.3.3 Local preference matching Consider such a parsing state after the simple matching operation SM(ij): \[ti_ 1 MC(ij) tj+l\] Starting from it, there are two possible expanded matching operations: EM(i-Ij) or EM(ij+I). All of them must be processed according to basic matching algorithm, and two candidate matched constituents: MC(i-Ij) and MC(i,j+I), will be produced. But in many cases, one of these operations is unnecessary because only one candidate constituent may be included in the best parse tree. These superfluous matching operations reduces the parsing efficiency of the basic matching algorithm. Let &amp;quot;A B C&amp;quot; to be the local matching context (For the above example, we have: A=\[ti. 1, B= MC(ij), and C ffi tj+l\] ). P(B,C) is the fight combination probability for constituent 'B' and P(A,B) is its left combination probabilit~ which can be easily computed using the constituent preference data ($4) described in sC/ction 2. Set ~=--0.~-as-the_difference threshold. Then, a simple preference-based approach can be added into the basic matching algorithm to improve the parsing efficiency: if P(B,C)-P(A,B)&gt;ct, then the matching Ol~eration \[A,B\] will be discarded.</Paragraph>
      <Paragraph position="9"> if P(A,B)-P(B,C)&gt;~ then the matching operation \[B,C\] will be discarded.</Paragraph>
      <Paragraph position="10"> * 3.4 Statistical disamBiguation model This section describes the way the best syntactic tree is selected. A statistical approach to this problem is to use SCFG rules extracted from treebank and set a probability score scheme for disambiguation. Assume a constituent labeled with syntactic tag PH is composed by the syntactic components RP1,</Paragraph>
      <Paragraph position="12"> where the probability P(PH-,. RP 1 RP 2 ... RPn) comes from statistical data(S5) defined in section 2. In addition, ffRP i is a word component, then set/'(RPi) = 1.</Paragraph>
      <Paragraph position="13"> By computing logarithm on both sides of equation (7), we will get the probability score $core(P.lt): \[0 \]</Paragraph>
      <Paragraph position="15"> Formally, a labeled constituent MC(I,n) may be looked as a syntactic tree. Therefore, the most likely parse tree under this score model is then this kind of matched constituent with the maximum probability score, i.e. Tbest = argmax Score(MC(1,n)).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML