File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0138_metho.xml

Size: 7,556 bytes

Last Modified: 2025-10-06 14:10:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0138">
  <Title>Using Part-of-Speech Reranking to Improve Chinese Word Segmentation</Title>
  <Section position="5" start_page="205" end_page="206" type="metho">
    <SectionTitle>
3 Features
3.1 Features for Segmentation
</SectionTitle>
    <Paragraph position="0"> We adopted the basic segmentation features used in (Ng and Low, 2004). These features are summarized in Table 1 ((1.1)-(1.7)). In these templates, C0 refers to the current character, and C[?]n, Cn refer to the characters n positions to the left and right of the current character, respectively. Pu(C0) indicates whether C0 is a punctuation. T(Cn) classifies the character Cn into four classes: numbers, dates (year, month, date), English letters and all other characters. LBegin(C0), LEnd(C0) and LMid(C0) represent the maximum length of words found in a lexicon1 that contain the current character as either the first, last or middle character, respectively. Single(C0) indicates whether the current character can be found as a single word in the lexicon.</Paragraph>
    <Paragraph position="1"> Besides the adopted basic features mentioned above, we also experimented with additional semantic features (Table 1 (1.8)). For (1.8), Sem0 refers to the semantic class of current character, and Sem[?]1, Sem1 represent the semantic class of characters one position to the left and right of the current character, respectively. We obtained a character's semantic class from HowNet (Dong and Dong, 2006). Since many characters have multiple semantic classes defined by HowNet, it is a non-trivial task to choose among the different semantic classes. We performed contextual disambiguation of characters' semantic classes by calculating semantic class similarities. For example, let us assume the current character is  newspaper). The character _d_4663(look) has two semantic classes in HowNet, i.e. _d_6321(read) and _d_1499 _d_3826(doctor). To determine which class is more appropriate, we check the example words illustrating the meanings of the two semantic classes, given by HowNet. For _d_6321(read), the example word is _d_4663_d_1015(read book); for _d_1499_d_3826(doctor), the example word is _d_4663_d_4542(see a doctor). We then calculated the semantic class similarity scores between _d_2958(newspaper) and _d_1015(book), and _d_2958(newspaper) and _d_4542(illness), using HowNet's built-in similarity measure function. Since _d_2958(newspaper) and _d_1015(book) both have semantic class _d_3230_d_1015(document), their maximum similarity score is 0.95, where the maximum similarity score between _d_2958(newspaper) and _d_4542(illness) is 0.03478. Therefore, Sem0Sem1 =_d_6321(read),_d_3230 _d_1015(document). Similarly, we can figure out Sem[?]1Sem0. For Sem0, we simply picked the top four semantic classes ranked by HowNet, and used &amp;quot;'NONE&amp;quot;' for absent values.</Paragraph>
    <Section position="1" start_page="206" end_page="206" type="sub_section">
      <SectionTitle>
3.2 Features for POS Tagging
</SectionTitle>
      <Paragraph position="0"> The bottom half of Table 1 summarizes the feature templates we employed for POS tagging. W0 denotes the current word. W[?]n and Wn refer to the words n positions to the left and right of the current word, respectively. Cn(W0) is the nth character in current word. If the number of characters in the word is less than 5, we use &amp;quot;NONE&amp;quot; for absent characters. Len(W0) is the number of characters in the current word. We also used a group of binary features for each word, which are used to represent the morphological properties of current word, e.g. whether the current word is punctuation, number, foreign name, etc.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="206" end_page="207" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We evaluated our system's segmentation results on the SIGHAN Bakeoff 2006 dataset. To evaluate our reranking method's impact on the POS tagging part, we also performed 10-fold cross-validation tests on the 250k Penn Chinese Treebank (CTB 250k). The CRF model for POS tagging is trained on CTB 250k in all the experiments. We report recall (R), precision (P), and F1-score (F) for both word segmentation and POS tagging tasks. N value is chosen to be 20 for the N-best list reranking, based on cross validation. For CRF learning and decoding, we use the CRF++ toolkit2.</Paragraph>
    <Section position="1" start_page="206" end_page="207" type="sub_section">
      <SectionTitle>
4.1 Results on Bakeoff 2006 Dataset
</SectionTitle>
      <Paragraph position="0"> of SIGHAN Bakeoff 2006.</Paragraph>
      <Paragraph position="1"> We participated in the open tracks of the SIGHAN Bakeoff 2006, and we achieved F-scores of 0.935 (UPUC), 0.964 (CityU), 0.952 (MSRA) and 0.949 (CKIP). More detailed performances statistics including in-vocabulary recall (Riv) and out-of-vocabulary recall (Roov) are shown in Table 2.</Paragraph>
      <Paragraph position="2"> More interesting to us is how much the N-best list reranking method using POS tagging helped to increase segmentation performance. For comparison, we ran a linear-cascade of segmentation and POS tagging CRFs without reranking as the baseline system, and the results are shown in Table 3. We can see that our reranking method consistently improved segmentation scores. In particular, there is a greater improvement gained in recall than precision across all four tracks. We observed the greatest improvement from the UPUC track.</Paragraph>
      <Paragraph position="3"> We think it is because our POS tagging model is trained on CTB 250k, which could be drawn from the same corpus as the UPUC training data, and therefore there is a closer mapping between segmentation standard of the POS tagging training data and the segmentation training data (at this</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="207" end_page="207" type="metho">
    <SectionTitle>
4.2 Results on CTB Corpus
</SectionTitle>
    <Paragraph position="0"> To evaluate our reranking method's impact on the POS tagging task, we also tested our systems on CTB 250k corpus using 10-fold cross-validation.</Paragraph>
    <Paragraph position="1"> Figure 1 summarizes the results of segmentation and POS tagging tasks on CTB 250k corpus. From figure 1 we can see that our reranking method improved both the segmentation and tagging accuracies across all 10 tests. We conducted pairwise t-tests and our reranking model was found to be statistically significantly better than the baseline model under significance level of 5.0[?]4 (p-value for segmentation) and 3.3[?]5 (p-value for POS tagging). null</Paragraph>
  </Section>
  <Section position="8" start_page="207" end_page="207" type="metho">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> Our system uses conditional random fields for performing Chinese word segmentation and POS tagging tasks simultaneously. In particular, we proposed an approximated joint decoding method by reranking the N-best segmenter output, based POS tagging information. Our experimental results on both SIGHAN Bakeoff 2006 datasets and Chinese Penn Treebank showed that our reranking method consistently increased both segmentation and POS tagging accuracies. It is worth noting that our reranking method can be applied not only to Chinese segmentation and POS tagging tasks, but also to many other sequential tasks that can benefit from learning transfer, such as POS tagging and NP-chunking.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML