File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1067_intro.xml

Size: 4,745 bytes

Last Modified: 2025-10-06 14:02:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1067">
  <Title>Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Previous Work on Word Segmentation
</SectionTitle>
    <Paragraph position="0"> Our method is based on two existing methods for Chinese or Japanese word segmentation, and we explain them in this section.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Markov Model-Based Method
</SectionTitle>
      <Paragraph position="0"> Word-based Markov models are used in English part-of-speech (POS) tagging (Charniak et al., 1993; Brants, 2000). This method identifies POS-tags T = t1;:::;tn, given a sentence as a word sequence W = w1;:::;wn, where n is the number of words in the sentence. The method assumes that each word has a state which is the same as the POS of the word and the sequence of states is a Markov  chain. A state t transits to another state s with probability P(sjt), and outputs a word w with probability P(wjt). From such assumptions, the probability that the word sequence W with parts-of-speech T is generated is</Paragraph>
      <Paragraph position="2"> where w0(t0) is a special word(part-of-speech) representing the beginning of the sentence. Given a word sequence W, its most likely POS sequence ^T can be found as follows:</Paragraph>
      <Paragraph position="4"> The equation above can be solved efficiently by the Viterbi algorithm (Rabiner and Juang, 1993).</Paragraph>
      <Paragraph position="5"> In Chinese and Japanese, the method is used with some modifications. Because each word in a  sentence is not separated explicitly in Chinese and Japanese, both segmentation of words and identification of the parts-of-speech tags of the words must be done simultaneously. Given a sentence S, its most likely word sequence ^W and POS sequence ^T can be found as follows where W ranges over the possible segments of S (w1C/C/C/wn = S):</Paragraph>
      <Paragraph position="7"> The equation above can be solved using the Viterbi algorithm as well.</Paragraph>
      <Paragraph position="8"> The possible segments of a given sentence are represented by a lattice, and Figure 1 shows an example. Given a sentence, this method first constructs such a lattice using a word dictionary, then chooses the best path which maximizes Equation (3).</Paragraph>
      <Paragraph position="9"> This Markov model-based method achieves high accuracy with low computational cost, and many Japanese word segmentation systems adopt it (Kurohashi and Nagao, 1998; Matsumoto et al., 2001). However, the Markov model-based method has a difficulty in handling unknown words. In the constructing process of a lattice, only known words are dealt with and unknown words must be handled with other methods. Many practical word segmentation systems add candidates of unknown words to  the lattice. The candidates of unknown words can be generated by heuristic rules(Matsumoto et al., 2001) or statistical word models which predict the probabilities for any strings to be unknown words (Sproat et al., 1996; Nagata, 1999). However, such heuristic rules or word models must be carefully designed for a specific language, and it is difficult to properly process a wide variety of unknown words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Character Tagging Method
</SectionTitle>
      <Paragraph position="0"> This method carries out word segmentation by tagging each character in a given sentence, and in this method, the tags indicate word-internal positions of the characters. We call such tags positionof-character (POC) tags (Xue, 2003) in this paper. Several POC-tag sets have been studied (Sang and Veenstra, 1999; Sekine et al., 1998), and we use the 'B, I, E, S' tag set shown in Table 1 1.</Paragraph>
      <Paragraph position="1"> Figure 2 shows an example of POC-tagging. The POC-tags can represent word boundaries for any sentences, and the word segmentation task can be reformulated as the POC-tagging task. The tagging task can be solved by using general machine learning techniques such as maximum entropy (ME) models (Xue, 2003) and support vector machines (Yoshida et al., 2003; Asahara et al., 2003).</Paragraph>
      <Paragraph position="2">  This character tagging method can easily handle unknown words, because known words and unknown words are treated equally and no other exceptional processing is necessary. This approach is also used in base-NP chunking (Ramshaw and Marcus, 1995) and named entity recognition (Sekine et al., 1998) as well as word segmentation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML