File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1701_metho.xml

Size: 8,081 bytes

Last Modified: 2025-10-06 14:08:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1701">
  <Title>Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation</Title>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
3 Ensemble of Naive Bayesian Classifier
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
for Overlapping Ambiguity Resolution
3.1 Problem Definition
</SectionTitle>
      <Paragraph position="0"> We first give the formal definition of overlapping ambiguous string (OAS) and longest OAS.</Paragraph>
      <Paragraph position="1"> An OAS is a Chinese character string O that satisfies the following two conditions: a) There exist two segmentations Seg  are different from either literal strings or positions;  overlap.</Paragraph>
      <Paragraph position="2"> The first condition ensures that there are ambiguous word boundaries (if more than one word segmentors are applied) in an OAS. In the example presented in section 1, the string &amp;quot;g2520g3281 g7389&amp;quot; is an OAS but &amp;quot;g2520g3281g7389g1237g1006&amp;quot; is not because the word &amp;quot;g1237g1006&amp;quot; remains the same in both FMM and BMM segmentations of &amp;quot;g2520  |g3281g7389  |g1237g1006&amp;quot; and</Paragraph>
      <Paragraph position="4"> that the ambiguous word boundaries result from crossing brackets. As illustrated in Figure 1, words &amp;quot;g2520g3281&amp;quot; and &amp;quot;g3281g7389&amp;quot; form a crossing bracket. The longest OAS is an OAS that is not a sub-string of any other OAS in a given sentence. For example, in the case &amp;quot;g10995g8975g8712g5191&amp;quot; (sheng1-huo2shui3-ping2, living standard), both &amp;quot;g10995g8975g8712&amp;quot; and &amp;quot;g10995g8975g8712g5191&amp;quot; are OASs, but only &amp;quot;g10995g8975g8712g5191&amp;quot; is the longest OAS because &amp;quot;g10995g8975g8712&amp;quot; is a substring of &amp;quot;g10995g8975g8712g5191&amp;quot;. In this paper, we only consider the longest OAS because both left and right boundaries of the longest OAS are determined.</Paragraph>
      <Paragraph position="5"> Furthermore, we constrain our search space within the FMM segmentation O</Paragraph>
      <Paragraph position="7"> of a given longest OAS. According to Huang (1997), two important properties of OAS has been identified: (1) if the FMM segmentation is the same as its BMM segmentation (O</Paragraph>
      <Paragraph position="9"> Search Engine), the probability that the MM segmentation is correct is 99%; Otherwise, (2) if the FMM segmentation differs from its BMM segmen-</Paragraph>
      <Paragraph position="11"> ), for example &amp;quot;g2520g3281g7389&amp;quot;, the probability that at least one of the MM segmentation is correct is also 99%. So such a strategy will not lower the coverage of our approach.</Paragraph>
      <Paragraph position="12"> Therefore, the overlapping ambiguity resolution can be formulized as a binary classification problem as follows: Given a longest OAS O and its context feature set C, let G(Seg, C) be a score function of Seg for },{ bf OOseg [?] , the overlapping ambiguity resolution task is to make the binary decision:  means that both FMM and BMM arrive at the same result. The classification process can then be stated as:</Paragraph>
      <Paragraph position="14"> , then choose either segmentation result since they are same; b) Otherwise, choose the one with the higher score G according to Equation (1).</Paragraph>
      <Paragraph position="15"> For example, in the example of &amp;quot;g6640g13046g5353g6818&amp;quot;, if</Paragraph>
      <Paragraph position="17"> lected as the answer. In another example of &amp;quot;g2520g3281 g7389&amp;quot; in sentence (1) of Figure 1, O</Paragraph>
      <Paragraph position="19"/>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Naive Bayesian Classifier for Overlapping
Ambiguity Resolution
</SectionTitle>
      <Paragraph position="0"> Last section formulates the overlapping ambiguity resolution of an OAS O as the binary classification between O</Paragraph>
      <Paragraph position="2"> . This section describes the use of the adapted Naive Bayesian Classifier (NBC) (Duda and Hart, 1973) to address problem. Here, we use the words around O within a window as features, with w</Paragraph>
      <Paragraph position="4"> denoting m words on the left of the O and w</Paragraph>
      <Paragraph position="6"> denoting n words on the right of the O. Naive Bayesian Classifier assumes that all the feature variables are conditionally independent. So the joint probability of observing a set of context features C = {w  in Equation (1) G, we then have two parameters to be estimated: p(Seg) and p(w</Paragraph>
      <Paragraph position="8"> |Seg). Since we do not have enough labeled training data, we then resort to the redundancy property of natural language. Due to the fact that the OAS occupies only in a very small portion of the entire Chinese text, it is feasible to estimate the word co-occurrence probabilities from the portion of corpus that contains no overlapping ambiguities. Consider an OASg1461g5527g3332 (xin4-xin1-de, confidently). The correct segmentation would be &amp;quot;g1461g5527  |g3332&amp;quot;, if g1817g9397 (cong1-man3, full of) were its context word. We note thatg1817g9397 appears as the left context word of g1461g5527 in both strings g1817g9397g1461g5527g3332 and g1817g9397g1461g5527g2656g2203g8680 (g2203g8680, yong3-qi4, courage). While the former string contains an OAS, the latter does not. We then remove all OAS from the training data, and estimate the parameters using the training data that do not contain OAS. In experiments, we replace all longest OAS that has O</Paragraph>
      <Paragraph position="10"> with a special token [GAP].</Paragraph>
      <Paragraph position="11"> Below, we refer to the processed corpus as tokenized corpus.</Paragraph>
      <Paragraph position="12"> Note that Seg is either the FMM or the BMM segmentation of O, and all OASs (including Seg) have been removed from the tokenized corpus, thus there are no statistical information available to  on the Maximum Likelihood Estimation (MLE) principle. To estimate them, we introduce the following two assumptions.</Paragraph>
      <Paragraph position="13"> 1) Since the unigram probability of each word w can be estimated from the training data, for a given segmentation Seg=w</Paragraph>
      <Paragraph position="15"> , we assume that each word w of Seg is generated independently. The probability p(Seg) is approximated by the production of the word unigram  where the word sequence probabilities P(w</Paragraph>
      <Paragraph position="17"> ) are decomposed as productions of trigram probabilities. We used a statistical language model toolkit described in (Gao et al, 2002) to build trigram models based on the tokenized corpus.</Paragraph>
      <Paragraph position="18"> Although the final language model is trained based on a tokenized corpus, the approach can be regarded as an unsupervised one from the view of the entire training process: the tokenized corpus is automatically generated by an MM based segmentation tool from the raw corpus input with neither human interaction nor manually labeled data required. null</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Ensemble of Classifiers and Majority
Vote
</SectionTitle>
      <Paragraph position="0"> Given different window sizes, we can obtain different classifiers. We then combine them to achieve better results using the so-called ensemble learning (Peterson 2000). Let NBC(l, r) denote the classifier with left window size l and right window size r. Given the maximal window size of 2, we then have 9 classifiers, as shown in Table 1.</Paragraph>
      <Paragraph position="2"> The ensemble learning suggests that the ensemble classification results are based on the majority vote of these classifiers: The segmentation that is selected by most classifiers is chosen.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML