File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3026_metho.xml
Size: 6,574 bytes
Last Modified: 2025-10-06 14:09:43
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3026"> <Title>Description of the HKU Chinese Word Segmentation System for Sighan Bakeoff 2005</Title> <Section position="3" start_page="0" end_page="165" type="metho"> <SectionTitle> 2 Overview of the System </SectionTitle> <Paragraph position="0"> In practice, our system works in two major steps as follows: The first step is a process of known word segmentation, which aims to segment an input sequence of Chinese characters into a sequence of known words that are listed in the system dictionary. In our current system, we apply a known word bigram model to perform this task (Fu and Luke, 2003; Fu and Luke, 2004; Fu and Luke, 2005).</Paragraph> <Paragraph position="1"> Actually, known word segmentation is a process of disambiguation. Given a Chinese character string ncccC L21= , there are usually multiple possible segmentations of known words mwwwW L21= according to the system dictionary. The task of known word segmentation is to find a proper segmentation mwwwW L21^ = that maximizes the probability [?]</Paragraph> <Paragraph position="3"> The second step is actually a tagging task on the sequence of known words acquired in the first step, which intends to detect unknown words or out-of-vocabulary (OOV) words in the input. In this process, each known word yielded in the first step will be further assigned a proper tag that indicates whether the known word is an independent segmented word by itself or a beginning/middle/ending component of an OOV word (Fu and Luke, 2004). In order to improve our system, part-of-speech information is also introduced in some tracks such as the PKU open test and the AS open test. Furthermore, a lexicalized HMM tagger is developed to perform this task (Fu and Luke, 2004).</Paragraph> <Paragraph position="4"> Given a sequence of known words nwwwW L21= , the lexicalized HMM tagger attempt to find an appropriate sequence of tags</Paragraph> <Paragraph position="6"/> </Section> <Section position="4" start_page="165" end_page="166" type="metho"> <SectionTitle> 3 Settings for Different Tracks </SectionTitle> <Paragraph position="0"> Sinica (AS) open test and the Peking University (PKU) open test, our system is trained respectively using the Sinica Corpus (3.0) and the PFR Corpus. In all other tests, including all closed tests, City University of Hong Kong (CityU) open test and Microsoft Research (MSR) open test, we trained our system using the relevant training corpora provided for the bakeoff.</Paragraph> <Paragraph position="1"> In the closed test, the system dictionaries are derived automatically from the relevant training corpora for this bakeoff by using the following three criteria: (1) Each character in the training corpus is taken as an independent entry and collected into the relevant system dictionary. (2) A standard Chinese word in the training corpus will enter to the relevant dictionary if it has four or less Chinese characters within it, and at the same time, its counts of occurrence in the corpus is observed to be larger than a threshold. In our current system, the threshold is set to 10 for the AS closed test and 5 for other closed tests. (3) For non-standard Chinese words such as numeral expressions, English words and punctuations, if they consist of multiple characters, they will be not included in the system dictionary.</Paragraph> <Paragraph position="2"> As for the open test, some other dictionaries are applied. As can be seen from Table 2, the CKIP Lexicon and Chinese Grammar is used in both AS and CityU open test, and the Grammatical Knowledge-base of Contemporary Chinese developed by the Peking University is utilized in both PKU and MSR open test.</Paragraph> <Paragraph position="3"> It should be noted that part-of-speech information is also utilized in the AS open test and the PKU open test, because part-of-speech information proved to be informative in identifying OOV words in Chinese text (Fu and Luke, 2004). Therefore, the training corpora for the two tests are tagged with part-of-speech, and entries in the relevant dictionaries are defined with their potential part-of-speech categories.</Paragraph> </Section> <Section position="5" start_page="166" end_page="166" type="metho"> <SectionTitle> 4 The Scored Results </SectionTitle> <Paragraph position="0"> In Bakeoff 2005, six measures are employed to score the performance of a word segmentation system, namely recall (R), precision (P), the evenly-weighted F-measure (F), out-of-vocabulary (OOV) rate for the test corpus, recall with respect to OOV words (ROOV) or in-vocabulary words (Riv).</Paragraph> <Paragraph position="1"> In order to achieve a consistent evaluation of our system in both the closed test and the open test, OOV is defined in this paper as the set of words in the test corpus but not occurring in both the training corpus and the system dictionary. Furthermore, the additional two rates, i.e. OOV-C and OOV-D are used to denote the out-of-vocabulary rate with respect to the training corpus and the out-of-vocabulary rate with respect to the system dictionary, respectively. At the same time, the precision with regard to in-vocabulary words (Piv) and OOV words (POOV) are also computed in this paper to give a more complete evalution of our system in unknown word identification.</Paragraph> <Paragraph position="2"> The OOV rates and scores of our system are summarized respectively in Table 3 and Table 4. The results show that our system can achieve a F-measure of 0.940-0.967 for different testing corpora while the relevant OOV rates are from 0.023 to 0.074.</Paragraph> <Paragraph position="3"> Although our system has achieved a promising performance, there is still much to be done to improve it. Firstly, our system is purely statistics-based, it cannot yield correct segmentations for all non-standard words (NSWs) such as numeral expressions and English strings in Chinese text. Secondly, known word segmentation and unknown word identification are taken as two independent stages in our system. This strategy is obviously simple and more easily applicable (Fu and Luke, 2003). Although the known word bigram model can partly resolve this problem, it is not always effective for some complicated strings that contains a mixture of ambiguities and unknown words, such as &quot;31g7097 g3824&quot; and the fragment &quot;g1025g15904g19283g14895&quot; in the sentence &quot;g1025g15904g19283g14895g6915g15904g8892g18337g1593g17535&quot;.</Paragraph> </Section> class="xml-element"></Paper>