File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1722_metho.xml

Size: 6,100 bytes

Last Modified: 2025-10-06 14:08:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1722">
  <Title>Chinese Word Segmentation at Peking University</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. The test corpus provider
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Corpus
</SectionTitle>
      <Paragraph position="0"> The corpus we provided to the sponsor includes: null A training set from People's Daily (January, 1998) null A test set from People's Daily (Page 4 of January 1, 1998) Data from People's Daily features standard Chinese, little language error, a wide coverage of linguistic phenomenon and topics, which are required for statistic training. Meanwhile, the corpus we provided is a latest version manually validated, hence a high level of correctness and consistency.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Specification
</SectionTitle>
      <Paragraph position="0"> When processing a corpus, we need a detailed and carefully designed specification for guidance. And when using the corpus for NLP evaluation, we also need such a specification to ensure a fair contest for different systems within a common framework.</Paragraph>
      <Paragraph position="1"> We provided the latest version of our specification, which has been published in the Journal of Chinese Information Processing.</Paragraph>
      <Paragraph position="2"> Based on our experience of large-scale corpus processing in recent years, the current version gave us different perspectives in a consistent way, and we hope it will also help others in this field know better of our segmented and POS-tagged corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. The participant
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Training and testing
</SectionTitle>
      <Paragraph position="0"> Our research on word segmentation has been focusing on People's Daily. As we are one of the two providers of Chinese corpora in GB code in this Bakeoff, we had to test on the Penn Chinese treebank.</Paragraph>
      <Paragraph position="1"> Not all the training and test corpus we got came from the Mainland China. Some were GB data converted from BIG5 texts of Taiwan. It is commonly known that in the Mainland, Hong Kong and Taiwai, the Chinese langauge is used diversely not only in the sense of different coding systems, but in respect of different wordings as well.</Paragraph>
      <Paragraph position="2"> While training our segmenter, we studied the guidelines and training corpus of Penn Chinese treebank, tracing the differences and working on them. The main difference between the work of U. Penn and that of ours is notion of &amp;quot;word&amp;quot;. For instance: Differences of &amp;quot;Word&amp;quot; U. Penn PKU  These are different combinations in regard of words which follow certain patterns, and can therefore be handled easily by applying rules to the grogram. The real difficulty for us, however, is the following items:  recourses, so we had to find the lexical correspondence to reduce the negtive effect caused by the difference between Penn Chinese treebank and our own corpus.</Paragraph>
      <Paragraph position="3"> However, as the training corpus is small, we could not remove all the negative effect, and the untackled problems remained to affect our test result.</Paragraph>
      <Paragraph position="4"> Further, as we have been working on language data from the Mainland China, the lexicon of our segmenter does not contain words used in Taiwan. Such being the case, we added into our lexicon the entries that were not known (i.e., not found in the training set) and could not be handled by the rule-based makeshift either. But because we are not very familiar with the Chinese language used in Taiwan, we could not make a complete patch due to the limit of time.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Result analysis
</SectionTitle>
      <Paragraph position="0"> From the test result that the sponsor provided, we can see our segmenter failed to score when the notion of &amp;quot;word&amp;quot; and the recognition of unknown words are involved.</Paragraph>
      <Paragraph position="1">  In addition, there are also cognitive differences concerning the objective world, which did come up to influence our fine score.  The recognition of unknown words has long been a bottleneck for word segmentation technique. So far we have not found a good solution, but we are confident about a progress in this respect in the near future.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Conclusion
</SectionTitle>
    <Paragraph position="0"> Word segmentation is the first step yet a key step in Chinese information processing, but we have not found a perfect solution up till now. From an engineering perspective, we think there is no need for a unique result of segmentation. All roads lead to Rome. The approach you take, technical or non-technical, will be a good one if the expected result is achieved. And it would be more desirable if the processing program in each step can tolerate or even correct the errors made in the previous step.</Paragraph>
    <Paragraph position="1"> We learn from our experience that the computer processing of natural language is a complex issue, which requires a solid fundamental research (on the language itself) to ensure a higher accuracy of automation. It is definitely hard to achieve an increase of one percent or even less in the accuracy of word segmentation, but we are still confident and will keep working in this respect.</Paragraph>
    <Paragraph position="2"> Finally, we would like to thank Dr. Li Baoli and Dr. Bing SWEN for their great efforts on the maintenance of our segmentation program.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML