File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0120_metho.xml

Size: 12,588 bytes

Last Modified: 2025-10-06 14:10:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0120">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching</Title>
  <Section position="4" start_page="0" end_page="136" type="metho">
    <SectionTitle>
2 Chinese Word Segmentation System
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="134" type="sub_section">
      <SectionTitle>
2.1 Conditional Random Fields
</SectionTitle>
      <Paragraph position="0"> Conditional random fields (CRFs) are undirected graphical models trained to maximize a conditional probability (Lafferty et al., 2001). A lin-</Paragraph>
      <Paragraph position="2"> is the normalization factor that makes the probability of all state sequences sum to one;</Paragraph>
      <Paragraph position="4"> is its weight. The feature  functions can measure any aspect of a state</Paragraph>
      <Paragraph position="6"> , and the entire observation sequence, x, centered at the current time step, t. For example, one feature function might have the value 1 when y</Paragraph>
      <Paragraph position="8"> x is the character &amp;quot;Guo &amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="134" end_page="135" type="sub_section">
      <SectionTitle>
2.2 Character Clustering
</SectionTitle>
      <Paragraph position="0"> In many cases, Chinese sentences may be interspersed with non-Chinese words. In a closed task, there is no way of knowing how many languages there are in a given text. Our solution is to apply a clustering algorithm to find homogeneous characters belonging to the same character clusters. One general rule we adopted is that a language's characters tend to appear together in tokens. In addition, character clusters exhibit certain distinct properties. The first property is that the order of characters in some pairs can be interchanged. This is referred to as exchangeability. The second property is that some characters, such as lowercase characters, can appear in any position of a word; while others, such as uppercase characters, cannot. This is referred to as location independence. According to the general rule, we can calculate the pairing frequency of characters in tokens by checking all tokens in the corpus. Assuming the alphabet is S , we first need to represent each character as a |S |dimensional vector. For each character c</Paragraph>
      <Paragraph position="2"> to represent its j-dimension value, which is calculated as follows:</Paragraph>
      <Paragraph position="4"> appear in the same word when c  within the range 0 to 1; a is used to enlarge the gap between non-zero and zero frequencies, and g is used to weaken the influence of very high frequencies.</Paragraph>
      <Paragraph position="5"> Next, we apply the K-means algorithm to generate candidate cluster sets composed of K clusters (Hartigan et al., 1979). Different K's, a 's, and g 's are used to generate possible character cluster sets. Our K-means algorithm uses the cosine distance.</Paragraph>
      <Paragraph position="6"> After obtaining the K clusters, we need to select the N  best character clusters among them.</Paragraph>
      <Paragraph position="7"> Assuming the angle between the cluster centroid vector and (1, 1, ... , 1) is th , the cluster with the largest cosine th will be removed. This is because characters whose co-occurrence frequencies are nearly all zero will be transformed into vectors very close to (a , a , ... , a ); thus, their centroids will also be very close to (a , a , ... , a ), leading to unreasonable clustering results.</Paragraph>
      <Paragraph position="8"> After removing these two types of clusters, for each character c in a cluster M, we calculate the inverse relative distance (IRDist) of c using</Paragraph>
      <Paragraph position="10"> stands for the centroid of cluster M</Paragraph>
      <Paragraph position="12"> and m stands for the centroid of M.</Paragraph>
      <Paragraph position="13"> We then calculate the average inverse distance for each cluster M. The N  best clusters are selected from the original K clusters. The above K-means clustering and character cluster selection steps are executed iteratively for each cluster set generated from K-means clustering with different K's, a 's, and g 's. After selecting the N  best clusters for each cluster set, we pool and rank them according to their inner ratios. Each cluster's inner ratio is calculated by the following formula:</Paragraph>
      <Paragraph position="15"> ) denotes the frequency with which characters c</Paragraph>
      <Paragraph position="17"> in the same word.</Paragraph>
      <Paragraph position="18"> To ensure that we select a balanced mix of clusters, for each character in an incoming cluster M, we use Algorithm 1 to check if the frequency of each character in C[?]M is greater than a threshold t .</Paragraph>
      <Paragraph position="19"> Algorithm 1 Balanced Cluster Selection Input: A set of character clusters P={M  4: pick the cluster M that has highest inner ratio; 5: for each character c in M do 6: if the frequency of c in C[?] M is over threshold t 7: P- P- M; 8: continue; 9 : else  10: C- C[?]M; 11: P- P- M; 12: end; 13: end; 14: end  The above algorithm yields the best N   clusters in terms of exchangeability. Next, we execute the above procedures again to select the best N  clusters based on their location independence and exchangeability. However, for</Paragraph>
      <Paragraph position="21"> to denote the value of its j-th dimension. We calculate v</Paragraph>
      <Paragraph position="23"> appear in the same word when c</Paragraph>
      <Paragraph position="25"> stands for the frequency with which c</Paragraph>
      <Paragraph position="27"> but not in the first position.</Paragraph>
      <Paragraph position="28"> We choose the minimum value from</Paragraph>
      <Paragraph position="30"> both appear in the first position of a word and their order is exchangeable, the four frequency values, including the minimum value, will all be large enough.  Our next goal is to create the best hybrid of the above two cluster sets. The set selected for exchangeability is referred to as the EX set, while the set selected for both exchangeability and location independence is referred to as the EL set. We create a development set and use the best first strategy to build the optimal cluster set from EX[?]EL. The EX and EL for the CTU corpus are shown in Table 1.</Paragraph>
    </Section>
    <Section position="3" start_page="135" end_page="135" type="sub_section">
      <SectionTitle>
2.3 Handling Non-Chinese Words
</SectionTitle>
      <Paragraph position="0"> Non-Chinese characters suffer from a serious data sparseness problem, since their frequencies are much lower than those of Chinese characters.</Paragraph>
      <Paragraph position="1"> In bigrams containing at least one non-Chinese character (referred as non-Chinese bigrams), the problem is more serious. Take the phrase &amp;quot;Yue Mo 20Sui &amp;quot; (about 20 years old) for example. &amp;quot;2&amp;quot; is usually predicted as I, (i.e., &amp;quot;Yue Mo &amp;quot; is connected with &amp;quot;2&amp;quot;) resulting in incorrect segmentation, because the frequency of &amp;quot;2&amp;quot; in the I class is much higher than that of &amp;quot;2&amp;quot; in the B class, even though the feature C  =&amp;quot;Yue Mo &amp;quot; has a high weight for assigning &amp;quot;2&amp;quot; to the B class. Traditional approaches to CWS only use one general tagger (referred as the G tagger) for segmentation. In our system, we use two CWS taggers. One is a general tagger, similar to the traditional approaches; the other is a specialized tagger designed to deal with non-Chinese words.</Paragraph>
      <Paragraph position="2"> We refer to the composite tagger (the general tagger plus the specialized tagger) as the GS tagger.</Paragraph>
      <Paragraph position="3"> Here, we refer to all characters in the selected clusters as non-Chinese characters. In the development stage, the best-first feature selector determines which clusters will be used. Then, we convert each sentence in the training data and test data into a normalized sentence. Each non-Chinese character c is replaced by a cluster representative symbol s M , where c is in the cluster M. We refer to the string composed of all s M as F. If the length of F is more than that of W, it will be shortened to W. The normalized sentence is then placed in one file, and the non-Chinese character sequence is placed in another. Next, we use the normalized training and test file for the general tagger, and the non-Chinese sequence training and test file for the specialized tagger. Finally, the results of these two taggers are combined.</Paragraph>
      <Paragraph position="4"> The advantage of this approach is that it resolves the data sparseness problem in non-Chinese bigrams. Consider the previous example in which s stands for the numeral cluster. Since there is a phrase &amp;quot;Yue Mo 8Nian &amp;quot; in the training data,  = &amp;quot;Mo 8&amp;quot; is still an unknown bigram using the G tagger. By using the GS tagger, however, &amp;quot;Yue Mo 20Sui &amp;quot; and &amp;quot;Yue Mo 8Nian &amp;quot; will be converted as &amp;quot;Yue Mo ssSui &amp;quot; and &amp;quot;Yue Mo sNian &amp;quot;, respectively. Therefore, the bigram feature C  longer unknown. Also, since s in &amp;quot;Mo s &amp;quot; is tagged as B, (i.e., &amp;quot;Mo &amp;quot; and &amp;quot;s &amp;quot; are separated), &amp;quot;Mo &amp;quot; and &amp;quot;s &amp;quot; will be separated in &amp;quot;Yue Mo ssSui &amp;quot;.</Paragraph>
    </Section>
    <Section position="4" start_page="135" end_page="136" type="sub_section">
      <SectionTitle>
2.4 Generating and Applying Templates
Template Generation
</SectionTitle>
      <Paragraph position="0"> We first extract all possible word candidates from the training set. Given a minimum word length L, we extract all words whose length is greater than or equal to L, after which we align all word pairs. For each pair, if more than fifty  percent of the characters are identical, a template will be generated to match both words in the pair. Template Filtering We have two criteria for filtering the extracted templates. First, we test the matching accuracy of each template t on the development set. This is calculated by the following formula: strings matched all of # separators no with strings matched of # )( =tA .</Paragraph>
      <Paragraph position="1"> In our system, templates whose accuracy is lower than the threshold t  are discarded. For the remaining templates, we apply two different strategies. According to our observations of the development set, most templates whose accuracy is less than t  are ineffective. To refine such templates, we employ the character class information generated by character clustering to impose a class limitation on certain template slots. This regulates the potential input and improves the precision. Consider a template with one or more wildcard slots. If any string matched with these wildcard slots contains characters in different clusters, this template is also discarded. Template-Based Post-Processing (TBPP) After the generated templates have been filtered, they are used to match our CWS output and check if the matched tokens can be combined into complete words. If a template's accuracy is greater than t  , then all separators within the matched strings will be eliminated; otherwise, for a template t with accuracy between t  we eliminate all separators in its matched string if no substring matched with t's wildcard slots contains characters in different clusters. Resultant words of less than three characters in length are discarded because CRF performs well with such words.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="136" end_page="136" type="metho">
    <SectionTitle>
3 Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="136" end_page="136" type="sub_section">
      <SectionTitle>
3.1 Dataset
</SectionTitle>
      <Paragraph position="0"> We use the three larger corpora in SIGHAN Bakeoff 2006: a Simplified Chinese corpus provided by Microsoft Research Beijing, and two Traditional Chinese corpora provided by Academia Sinica in Taiwan and the City University of Hong Kong respectively. Details of each corpus are listed in Table 2.</Paragraph>
      <Paragraph position="1">  enhanced GST tagger. We observe that the GST tagger outperforms the G tagger on all three corpora. null  ger and the GST Tagger</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML