File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3029_metho.xml

Size: 6,540 bytes

Last Modified: 2025-10-06 14:09:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3029">
  <Title>Maximal Match Chinese Segmentation Augmented by Resources Generated from a Very Large Dictionary for Post-Processing</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ACL SIGHAN-sponsored Second International
</SectionTitle>
    <Paragraph position="0"> Chinese Word Segmentation Bakeoff, namely Academia Sinica open (ASo) and Peking University open (PKo). The production segmentation system we used draws heavily on a large dictionary derived from processing a very large amount of synchronous textual data. In Section 2, our segmentation flow for the current Bakeoff will be described, and in Section 3, the results will be evaluated and analysed. Errors will be analysed and implications discussed in Section 4, followed by a conclusion in Section 5.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="176" type="metho">
    <SectionTitle>
2 Segmentation Framework
</SectionTitle>
    <Paragraph position="0"> The major resource of our segmentation system is a large dictionary. In the following, we describe the main segmentation mechanism based on maximal matching, and other supplementary features for post-processing attempted in the current Bakeoff.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Dictionary-based Segmentation
</SectionTitle>
      <Paragraph position="0"> The primary mechanism of segmentation makes use of a large dictionary derived from processing a large amount (over 150 million Chinese characters) of synchronous textual data, mostly printed news, gathered from various Chinese speech communities, including Beijing, Hong Kong, Taipei, and others, following a uniform segmentation standard. The dictionary has now grown to a size of over 800,000 word types, with frequencies of each entry being tracked closely. For this Bakeoff, additional items from the respective training data were also included in the existing dictionary for segmentation. Thus unsegmented texts will first go through a process of Backward Maximal Matching (BMM) segmentation equipped with the combined dictionary. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="176" type="sub_section">
      <SectionTitle>
2.2 Supplementary Features
</SectionTitle>
      <Paragraph position="0"> According to specific divergence of the segmentation standard of each test corpus from our production standard, a set of general adaptation rules were applied to transform the texts to achieve &amp;quot;standard complacency&amp;quot; as much as possible. The adaptation rules vary in nature,  depending on how intense the segmentation standard differences are between each test corpus and our own. Hence some rules are based on linguistic structures while others are based on particular treatment of elements like numerals and units.</Paragraph>
      <Paragraph position="1"> These adaptation rules are coupled with a set of heuristic segmentation disambiguation rules derived from our long-term and extensive processing of text data. Such rules are based on BMM, and amount to around 20,000 at the time of writing. Each rule has gone through careful consideration before putting to real production use, to ensure that they produce correct results in most cases without overgeneralisation.</Paragraph>
      <Paragraph position="2">  and Replacement After texts were segmented by BMM, the forward counterpart (Forward Maximal Matching, FMM) was also done for comparison, as the discrepancies between the two segmented texts often indicate potential ambiguities. Statistical information such as the frequency distributions of the segmented units in question were obtained from our large dictionary. By comparing the independent joint likelihood of the two combinations, segmented units with exceptionally low frequency are likely to be disregarded, allowing us to choose the correct segmentation. For example, in the test data, the phrase 4?.3 is segmented as4?/.3 by the backward approach, whereas4?.3/ will be obtained if segmented forwardly. The latter segmented alternative,4?.3/, is more likely to appear in the text.</Paragraph>
      <Paragraph position="3">  One of the most challenging issues in Chinese word segmentation is the treatment of unknown words which can be further divided into two categories: new words (NWs) and named entities (NEs). In our treatment of unknown words, a slight distinction was made between Chinese NEs and other NWs including foreign names.</Paragraph>
      <Paragraph position="4"> The detection processes are similar but statistical data were gathered from different portions of our textual data. When a sequence of single characters is hit, windows of two and three characters (only nominal morphemes were considered) were extracted to form &amp;quot;potential NE/NW candidates&amp;quot;. The likelihood of these characters being monosyllabic words (i.e. out-word) and that of being part of multi-syllabic words (i.e. in-word) were compared to make the best guess whether they should be combined or segmented.</Paragraph>
      <Paragraph position="5"> For NE detection, the in-word statistics was based on all the multi-syllabic named entities in the Taipei portion from our dictionary and the out-word statistics on the rest of it. The in-word frequency of a given character is thus the number of times the character appears within a multi-syllabic named entity. The in-word probability is the in-word frequency divided by the total number of times the character appears in all our textual data. The independent joint in-word and out-word probabilities were computed and compared for each candidate, which would be combined as a word if the in-word probability is greater than the out-word probability and the first character in the candidate is within a list of Chinese surnames, again collected from all textual data.</Paragraph>
      <Paragraph position="6"> For NW detection, the in-word statistics was based on all the multi-syllabic words in our dictionary. For every newly combined word, neighbouring prefixes and suffixes (according to those provided in the segmentation standard) were also detected and combined, if any. A list of foreign names and all the characters appearing in them was also extracted from our dictionary. When a new word is detected, its neighbouring words would be scanned and would be combined if they are within this foreign name list, thus enabling the identification of names like)| _!.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML