File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3020_intro.xml
Size: 3,763 bytes
Last Modified: 2025-10-06 14:02:56
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3020"> <Title>Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff</Title> <Section position="2" start_page="0" end_page="142" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the research fields of Chinese natural language processing (NLP), a high-performance Chinese word segmentor (CWS) is a useful pre-processing stage to produce an intermediate result for later processes, such as search engines, text mining and speech recognition, etc. The bottleneck of developing a high-performance CWS is to comprise of a high-performance Chinese UWI (Lin et al. 1993; Tsai et al. 2003). It is because Chinese is written without any separation between words and meanwhile more than 50% words of the Chinese texts in web corpus are out-of-vocabulary (Tsai et al. 2003).</Paragraph> <Paragraph position="1"> Conventionally, there are four approaches to develop a CWS: (1) Dictionary-based approach (Cheng et al. 1999), especial forward and backward maximum matching (Wong and Chan, 1996); (2) Linguistic approach based on syntax-semantic knowledge (Chen et al. 2002); (3) Statistical approach based on statistical language model (SLM) (Sproat and Shih, 1990; Teahan et al. 2000; Gao et al. 2003); and (4) Hybrid approach trying to combine the benefits of dictionary-based, linguistic and statistical approaches (Tsai et al. 2003; Ma and Chen, 2003). In practice, statistical approaches are most widely used because their effective and reasonable performance. For a CWS, there are two types of word segmentation ambiguities while there are no unknown words in them: (1) Overlap ambiguity (OA), take a character string ABC as an example. If its segmentation can be either AB/C or A/BC depending on different context, the ABC is called an overlap ambiguity string (OAS), such as &quot;+^u(a general)/D(use)&quot; and &quot;+(to get)/^uD(for military use)&quot; (the symbol &quot;/&quot; indicates a word boundary); (2) Combination ambiguity (CA), take a character string AB as an example. If its segmentation can be either A/B or AB depending on different context, the AB is called a combination ambiguity string (CAS), such as &quot;1(just)/P(can)&quot; and &quot;1P (ability).&quot; Meantime, there are two types of error segmentation caused by unknown word problem: (1) Lack of unknown word (LUW), it means the error segmentation occurred by lack of an unknown word in the system dictionary, such as &quot;D/Y'/at5W&quot;; (2) Error identified word (EIW), it means the error segmentation occurred by an error identified unknown words, such as &quot;!# @a.&quot; To sum up, for a CWS in most case the UWI is a pre-processing stage to detect unknown words for the optimization of LUW-EIW tradeoff, and then to disambiguate those autodetected OAS and CAS problems from the segmentation results.</Paragraph> <Paragraph position="2"> The goal of this paper is to illustrate and report the effectiveness and the scored results of our BMM-based CWS for the second International Chinese Word Segmentation Bakeoff in the MSR closed (MSR_C) track. For this Bakeoff, our CWS is mainly addressed on optimizing the LUW-EIW tradeoff.</Paragraph> <Paragraph position="3"> The remainder of this paper is arranged as follows. In Section 2, we present the details of our BMM-based CWS comprised of a context-based UWI. In Section 3, we present the scored results of the CWS in the MSR_C track and give our analysis. Finally, in Section 4, we give our conclusions and suggest some future research directions.</Paragraph> </Section> class="xml-element"></Paper>