File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2123_intro.xml

Size: 4,088 bytes

Last Modified: 2025-10-06 14:03:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2123">
  <Title>Segmentation</Title>
  <Section position="2" start_page="0" end_page="961" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many approaches have been proposed in Chinese word segmentation in the past decades. Segmentation performance has been improved significantly, from the earliest maximal match (dictionary-based) approaches to HMM-based (Zhang et al., 2003) approaches and recent state-of-the-art machine learning approaches such as maximum entropy (Max-Ent) (Xue and Shen, 2003), support vector machine [?]Now the second author is affiliated with NTT.</Paragraph>
    <Paragraph position="1"> (SVM) (Kudo and Matsumoto, 2001), conditional random fields (CRF) (Peng and McCallum, 2004), and minimum error rate training (Gao et al., 2004).</Paragraph>
    <Paragraph position="2"> By analyzing the top results in the first and second Bakeoffs, (Sproat and Emerson, 2003) and (Emerson, 2005), we found the top results were produced by direct or indirect use of so-called &amp;quot;IOB&amp;quot; tagging, which converts the problem of word segmentation into one of character tagging so that part-of-speech tagging approaches can be used for word segmentation. This approach was also called &amp;quot;LMR&amp;quot; (Xue and Shen, 2003) or &amp;quot;BIES&amp;quot; (Asahara et al., 2005) tagging. Under the scheme, each character of a word is labeled as &amp;quot;B&amp;quot; if it is the first character of a multiple-character word, or &amp;quot;I&amp;quot; otherwise, and &amp;quot;O&amp;quot; if the character functioned as an independent word.</Paragraph>
    <Paragraph position="3"> For example, &amp;quot;(whole) g(Beijing city)&amp;quot; is labeled as &amp;quot;/O/B/I g/I&amp;quot;. Thus, the training data in word sequences are turned into IOB-labeled data in character sequences, which are then used as the training data for tagging. For new test data, word boundaries are determined based on the results of tagging.</Paragraph>
    <Paragraph position="4"> While the IOB tagging approach has been widely used in Chinese word segmentation, we found that so far all the existing implementations were using character-based IOB tagging. In this work we propose a subword-based IOB tagging, which assigns tags to a pre-defined lexicon subset consisting of the most frequent multiple-character words in addition to single Chinese characters. If only Chinese characters are used, the subword-based IOB tagging is downgraded to a character-based one. Taking the same example mentioned above, &amp;quot; g&amp;quot; is la- null beled as &amp;quot;/O/B g/I&amp;quot; in the subword-based tagging, where &amp;quot;/B&amp;quot; is labeled as one unit. We will give a detailed description of this approach in Section 2.</Paragraph>
    <Paragraph position="5"> There exists a clear weakness with the IOB tagging approach: It yields a very low in-vocabulary rate (R-iv) in return for a higher out-of-vocabulary (OOV) rate (R-oov). In the results of the closed test in Bakeoff 2005 (Emerson, 2005), the work of (Tseng et al., 2005), using CRFs for the IOB tagging, yielded a very high R-oov in all of the four corpora used, but the R-iv rates were lower. While OOV recognition is very important in word segmentation, a higher IV rate is also desired. In this work we propose a confidence measure approach to lessen this weakness. By this approach we can change the R-oov and R-iv and find an optimal tradeoff. This approach will be described in Section 2.3.</Paragraph>
    <Paragraph position="6"> In addition, we illustrate our word segmentation process in Section 2, where the subword-based tagging is described by the MaxEnt method. Section 3 presents our experimental results. The effects using the MaxEnts and CRFs are shown in this section.</Paragraph>
    <Paragraph position="7"> Section 4 describes current state-of-the-art methods with Chinese word segmentation, with which our results were compared. Section 5 provides the concluding remarks and outlines future goals.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML