File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0131_evalu.xml
Size: 4,575 bytes
Last Modified: 2025-10-06 13:59:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0131"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics POC-NLW Template for Chinese Word Segmentation</Title> <Section position="6" start_page="178" end_page="178" type="evalu"> <SectionTitle> 4 Results and Analysis </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="178" end_page="178" type="sub_section"> <SectionTitle> 4.1 System </SectionTitle> <Paragraph position="0"> The system submitted at this bakeoff was a two-stage one, as describe at beginning of this paper.</Paragraph> <Paragraph position="1"> The model used in the first stage was trigram, and the L max of the template used in the second stage was set to 7.</Paragraph> <Paragraph position="2"> In addition to the tags defined in the template before, a special tag is introduced into our Wl-Pn tag set to indicate all those characters that occur</Paragraph> </Section> <Section position="2" start_page="178" end_page="178" type="sub_section"> <SectionTitle> 4.2 Results at SIGHAN Bakeoff 2006 </SectionTitle> <Paragraph position="0"> Our system participated in the MSRA_Close and UPUC_Close track at the SIGHAN Bakeoff 2006. The test results are as showed in Table 1. The performances of our system on the two corpuses can rank in the half-top group among the participated systems.</Paragraph> <Paragraph position="1"> We notice that the accuracies on known word segmentation are relatively better than on OOV words segmentation. This appears somewhat unexpected. In the close experiments we had done on the PKU and MSR corpuses of SIGHAN Bakeoff 2005, the relative performance of OOV Recall was much more outstanding than of the Fmeasure. null We think this is due to the inappropriate parameters used in n-gram model, which overguarantees the performance of basic word segmentation. It can be seen on the IV Recall (highest in UPUC_Close track). For only the best output sequence of the n-gram model is transferred to the HMM tagger, some potential unknown words may be miss-split in the early stage. Thus, the OOV Recall is not very good, and this also affects the overall performance.</Paragraph> <Paragraph position="2"> On the other hand, the performances of OOV identification on UPUC are much better than on MSRA, while the performances of overall segmentation accuracy on UPUC are worse than on MSRA. This phenomenon also happened in our experiments on the Bakeoff 2005 corpuses of PKU and MSR. In the PKU test data, the rate of OOV words according is 0.058 while in MSR is 0.026. Thus, it can be conclude that the more unknown words occur, the more significant ability of OOV words identification appears.</Paragraph> <Paragraph position="3"> In addition, the relative performance of OOV Precision are much better. This demonstrates that the OOV identification ability of our system is appreciable. In other words, the POC-NLW tagging method introduced is effective to some extent. null</Paragraph> </Section> </Section> <Section position="7" start_page="178" end_page="179" type="evalu"> <SectionTitle> 5 CONCLUSION AND FURTHER WORK </SectionTitle> <Paragraph position="0"> In this paper, a POC-NLW template is presented for word segmentation, which aims at exploring the word creation mechanisms in Chinese language by utilizing the character-level information to. A two-stage strategy was applied in our system to combine the n-gram model based word segmentation and OOV word identification implemented by a HMM tagger. Test results show that the method achieved high performance on word segmentation, especially on unknown words identification. Therefore, the method is a practical one that can be implemented as an inte- null gral component in actual Chinese NLP applications. null From the results, it can safely conclude that method introduced here does find some character-level information, and the information could effectively conduct the word segmentation and unknown words identification. For this is the first time we participate in this bakeoff, and the work has been done as a integral part of another system during the past two months, the implementation of the segmentation system we submitted is coarse. A lot of improvements, on either theoretical methods or implementation techniques, are required in our future work, including the smoothing techniques in the n-gram model and the HMM model, the refine of the features extraction method and the POC-NLW template itself, the more harmonious integration strategy and so on.</Paragraph> </Section> class="xml-element"></Paper>