File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3020_metho.xml

Size: 8,010 bytes

Last Modified: 2025-10-06 14:09:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3020">
  <Title>Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff</Title>
  <Section position="3" start_page="142" end_page="143" type="metho">
    <SectionTitle>
2 Development of BMM-based CWS
</SectionTitle>
    <Paragraph position="0"> As per (Tsai et al. 2004), the Chinese word segmentation performance of BMM technique is about 1% greater than that of FMM technique.</Paragraph>
    <Paragraph position="1"> Thus, we adopt BMM technique as base to develop our CWS. The descriptions of symbols used in our CWS are given as below:  &lt;BOS&gt;: begin of sentence; &lt;EOS&gt;: end of sentence; &lt;BOW&gt;: begin of word; &lt;EOW&gt;: end of word; /: word boundary; +: inner word boundaries of the segmentation of  a system word segmented by BMM technique with the system dictionary exclusive of this system word; SWS (stop word string): for a system word (such as &amp;quot;F,(of)&amp;quot;), if the ratio (non-SWS probability) of total frequency of the other system words including it (such as &amp;quot;O6F, (beautiful)&amp;quot;) and its character string frequency is less than or equal to 1%, it is a SWS; SWBS (stop word bigram string): for a word bigram (such as &amp;quot;1(just)/P(can)&amp;quot;), if the ratio (non-SWBS probability) of its character string (such as &amp;quot;1P(ability)&amp;quot; frequency and its character string frequency is less than or equal to 1%, it is a SWBS; BMM-ASM (BMM ambiguity string mapping table: the BMM-ASM table lists all the pairs of correct SS (given in training corpus) and the error BMM SS (generated by BMM with the training system dictionary). Take the Chinese sentence &amp;quot;47DF)%&amp;quot; as an example. As per its MSR-standard segmentation &amp;quot;47D(effect)/F(really)/)%(good)&amp;quot; and its BMM segmentation &amp;quot;4(follow)/7DF (indeed)/)%(good),&amp;quot; the pair &amp;quot;47D/F&amp;quot;&amp;quot;4/7DF&amp;quot; is a BMM-ASM; TCT (triple context template): a TCT comprised of three items from left to right are: the left word, the segmented system word and the right word, where the system word is not a mono-syllabic Chinese word. Take the Chinese sentence &amp;quot;47D/F/)%&amp;quot; as an example. The two generated TCT are:</Paragraph>
    <Paragraph position="3"> WCT (word context template): a WCT comprised of three items from left to right are: &amp;quot;&lt;BOW&gt;&amp;quot;, the segmented system word and &amp;quot;&lt;EOW&gt;&amp;quot;, where the system word is not a mono-syllabic word. Take the system word &amp;quot;%/%+(lamasery)&amp;quot; as an example. Its two</Paragraph>
    <Paragraph position="5"> The algorithm of our BMM-based CWS comprised of a context-based UWI is as below: Step 1. Generate BMM segmentation for the input Chinese sentence with system dictionary, firstly. The system dictionary comprised of all word types found in the training corpus. Then, use BMM-ASM table to revise the matched BMM ambiguity string.</Paragraph>
    <Paragraph position="6"> Step 2. Use UWI to identify unknown words from the segmentation of Step 1 by the TCT knowledge, firstly. For the matched TCT, the characters between the left word and the right word will be combined as an UWI-identified word. If the UWI-identified word includes a SWS or a SWBS, it will be not an UWI-identified word. Then, use the system dictionary of Step 1 inclusive of the UWI-identified words of this step to repeat Step 1 process.</Paragraph>
    <Paragraph position="7"> Step 3. Add tags &amp;quot;&lt;BOW&gt;&amp;quot; and &amp;quot;&lt;EOW&gt;&amp;quot; at  the left-side and right-side of the continue 1-char character segmentations of Step 2, firstly. Then, use UWI to identify unknown words by the WCT knowledge. If the number of characters between &amp;quot;&lt;BOW&gt;&amp;quot; and &amp;quot;EOW&gt;&amp;quot; is same with that of the matched WCT, these 1-char characters will be combined as an UWI-identified word. If the UWI-identified word includes a SWS or a SWBS, it will be not an UWI-identified word. Finally, use the system dictionary of Step 2 inclusive of those UWI-identified words of this step to repeat Step 1 process.</Paragraph>
    <Paragraph position="8"> Step 4. Use UWI to combine a word bigram into a word by the following two conditions: (1) if the non-SWS probability of the right first character of the left-side word is greater or equal to 99% and (2) if the non-SWS probability of the left first character of the right-side word is greater or equal to 99%. Take the word bigram &amp;quot;ppp/p&amp;quot; as an example.</Paragraph>
    <Paragraph position="9"> Since the non-SWS probability of the right first character &amp;quot;p&amp;quot; of the left-side word &amp;quot;p pp&amp;quot; is 99.95%, &amp;quot;pppp&amp;quot; is identified as an UWI-identified word. If the UWI-identified word includes a SWS or a SWBS, it will be not an UWI-identified word. Finally, use the system dictionary of Step 3 inclusive of those UWI-identified words of this step to repeat the Step 1 process.</Paragraph>
    <Paragraph position="10"> Step 5. Repeat the Step 2 process.</Paragraph>
    <Paragraph position="11"> Step 6. Repeat the Step 3 process.</Paragraph>
    <Paragraph position="12"> Step 7. Repeat the Step 4 process.</Paragraph>
    <Paragraph position="13"> Step 8. Stop.</Paragraph>
    <Paragraph position="14"> In the above algorithm, Steps 2, 3 and 4 repeated at Steps 5, 6 and 7, respectively, are designed to show the recursive effect of our CWS.</Paragraph>
  </Section>
  <Section position="4" start_page="143" end_page="144" type="metho">
    <SectionTitle>
3 The Scored Results and Analysis
</SectionTitle>
    <Paragraph position="0"> In the 2nd Chinese Word Segmentation Bakeoff, there are four training corpus: AS (Academia Sinica) and CU (City University of Hong Kong) are traditional Chinese corpus, PU (Peking University) and Microsoft Research (MSR) are simplified Chinese corpus. Meanwhile, there are two testing tracks of this bakeoff: closed and open. We attend MSR_C track. The non-SWS and the non-SWBS probabilities of our CWS for this bakeoff are all set to 1%. And, the segmentation results of each step of our CWS are collected and scored, respectively.</Paragraph>
    <Section position="1" start_page="143" end_page="143" type="sub_section">
      <SectionTitle>
3.1 The Scored Results
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the details of MSR training and testing corpus. Note that, in Table 1, the details of MSR testing corpus were computed by us according to the MSR gold testing corpus. From  Table 2 shows the scored results of our CWS in MSR_C track. The performance of &amp;quot;Step 1(P)&amp;quot; in Table 2 was computed by us and the others were from the scored results. It shows a very high performance of 99.1% F-measure can be achieved while the BMM-based CWS by using a system dictionary comprised of word types found in the MSR training and testing corpus at Step 1 (&amp;quot;P&amp;quot; means &amp;quot;Perfect&amp;quot;).</Paragraph>
    </Section>
    <Section position="2" start_page="143" end_page="144" type="sub_section">
      <SectionTitle>
3.2 The Analysis
</SectionTitle>
      <Paragraph position="0"> Table 3 (see next page) shows the differences of F-measure and ROOV between each near-by step of our CWS. From Table 3, it indicates that the most contribution for increasing the overall performance (F-measure) of our CWS is at Step 3, which uses WCT knowledge.</Paragraph>
      <Paragraph position="1"> Table 4 (see next page) shows the distributions of four segmentation error types (OAS, CAS, LUW and EIW) for each step of our CWS.</Paragraph>
      <Paragraph position="2"> From Table 4, it shows that our context-based UWI with the knowledge of TCT and WCT can  effectively to optimize the LUW-EIW tradeoff. Moreover, from Table 4, it also shows that the knowledge of SWS, SWBS and BMM-ASM can effectively to resolve the CAS errors.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML