File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1211_metho.xml

Size: 6,572 bytes

Last Modified: 2025-10-06 14:07:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1211">
  <Title>Statistics Based Hybrid Approach to Chinese Base Phrase Identification</Title>
  <Section position="3" start_page="0" end_page="74" type="metho">
    <SectionTitle>
2 Statistics Based Hybrid Approach to
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="73" type="sub_section">
      <SectionTitle>
Chinese Base Phrase Identification
2.1 Concepts and Defmitions
</SectionTitle>
      <Paragraph position="0"> In addition to BNP, constituents of many local structure in Chinese centers around a core word with certain fixed POS sequences. Therefore their identification is slightly different from parsing in that it bears relatively simple phenomenon. Like BNP identification, identification of these phenomena before parsing will provide a simpler sequence for parser, and thus deserves a separate research.</Paragraph>
      <Paragraph position="1"> CutTenfly, we are considering 7 Chinese base phrases in our research, namely base adjective phrase(BADJP), base adverbial phrase (BADVP), base noun phrase (BNP),  base temporal phrase (BTN), base location phrase (BNS), base verb phrase (BVP) and base quantity phrase (BMP) Though theoretically definitions for these base phrases are still unavailable, Appendix I lists the preliminary illustrations for them in BNF format (necessary account for POS annotation can also be found)..</Paragraph>
      <Paragraph position="2"> To frame the identification of Chinese base phrases, we fm'ther develop the following concepts: Definition 1: Chinese based phrases are recognized as atomic parts of a sentence beyond words that posses certain functions and meanings. A base phrase may consist of words or other base phrases, but its constituents, in turn, should not contain any base phrases.</Paragraph>
      <Paragraph position="3"> Definition 2: Base phrase tag is the token representing the syntactic function of the phrase. At present, base tag either falls in one of the 7 Chinese base phrases we are considering or not:</Paragraph>
      <Paragraph position="5"> Definition 3: Boundary tag denotes the possible relative position of a word to a base phrase. A boundary tag for a gfven word is either L( left boundary of a base phrase), R( right boundary of a ), I(inside a base phrase) or O(outside the base phrase).</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
2.2 Duple Based HMM Parser
</SectionTitle>
      <Paragraph position="0"> Based on above definitions, we could, in view of Wojciech's proposal \[Wojeieeh and Thorsten, 1998\], interpret the parsing of Chinese base phrases as the following: Suppose the input as a sequence of POS annotations T= (to, ....... t,,). The task is to find RC, a most possible sequence of duples formed by base phrase tags and boundary tags, among the POS sequence T.</Paragraph>
      <Paragraph position="2"> in whil~h ri (l &lt;i&lt; =n )indicates the boundary tags, ci represents the base phrase tags.</Paragraph>
      <Paragraph position="3"> To go along with the POS tagger developed previously by us, we first think of preserving HMM (hidden Markov Model) for parsing Chinese base phrases. Thus the following formula is usually*at hand:</Paragraph>
      <Paragraph position="5"> For a given sequence of T, this formula can be transformed into:</Paragraph>
      <Paragraph position="7"> Essentially this model could be established through bigram or tri-gram statistical training by a annotated corpus. In practice, we just build our model from l O, O00 manual annotated sentences with common bi-gram training:</Paragraph>
      <Paragraph position="9"> In realization, a Viterbi algorithm is adopted to search the best path. An open test on additional 1000 sentences is performed to check its accuracy. Results are shown in</Paragraph>
    </Section>
    <Section position="3" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
2.3 Triple Based MM Exploiting
Linguistic Information
</SectionTitle>
      <Paragraph position="0"> Although results shown in Table 1 is encouraging enough for research purposes, it is still lies a long way for practical Chinese parser we are aiming at. Reasons for errors may be account by too coarse-grained information provided by RC.</Paragraph>
      <Paragraph position="1"> Observing the fact that the Chinese base phrase occurs more frequently with some fixed patterns, i.e. some frozen POS chains, we decide to improved our previous model by emphasizing the contribution given by POS information.</Paragraph>
      <Paragraph position="2"> Adding t denoting POS in the duple (r,  c), we develop a triple in the form of (t,r,e) for the calculation of a node. Naturally, the new model is changed into a MM (Markov model) as:</Paragraph>
      <Paragraph position="4"> To train this model, we still using a bi-gram model. Applying the same corpus and tests described above, we got the performance of triple based MM identifier for Chinese base phrases (see Table 2).</Paragraph>
    </Section>
    <Section position="4" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
2.4 Further Improvement Through TBED
Learning
</SectionTitle>
      <Paragraph position="0"> Like other statistical models, the above model, whether duple based or triple based, both seem to reach an accuracy ceiling after enlarging training set to 12, 000 or so. To cover the remaining accuracy, we apply the transformation-based error driven (TBED) learning strategy described in \[Brill, 1992\] to acquired desired rules.</Paragraph>
      <Paragraph position="1"> In our module, some initial rules are first designed as compensation of statistical model. Applying these rules will cause new mistakes as well as make correct identifications. Then the module will compare the processed texts with training sentences, generate new rules according to pre-defmed actions and update its rule bank after evaluation (see Fig 1.).</Paragraph>
      <Paragraph position="2">  The dotted line in fig 2. will stop functioning if pre-set accuracy is reached by the identifier for the Chinese base phrase. Evaluation of new rules is based on an greedy algorithm: only rule with max contribution (max correction and rain error) will be added. Design of rule generation (pre-defined actions) is similar to those described in \[Brill, 1992\].</Paragraph>
      <Paragraph position="3"> Table 3 shows a significant improvement after applying rules obtained through TBED learner. It is also the final performance of the proposed Chinese base phrase identification model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML