File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1815_metho.xml

Size: 9,765 bytes

Last Modified: 2025-10-06 14:08:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1815">
  <Title>CombiningClassifiersforChineseWordSegmentation</Title>
  <Section position="3" start_page="0" end_page="111" type="metho">
    <SectionTitle>
2 Combining Classifiers for
Chinesewordsegmentation
</SectionTitle>
    <Paragraph position="0"> Thetwomachine-learningmodelsweuseinthis work are the maximum entropy model (Ratnaparkhi 1996) and the error-driven transformation-based learning model (Brill 1994).Weusetheformerasthemainworkhorse and the latter to correct some of the errors producedbytheformer.</Paragraph>
    <Paragraph position="1"> 2.1Reformulatingwordsegmentation asataggingproblem Before we apply the machine-learning algorithms we first convert the manually segmented words in the corpus into a tagged sequenceofChinesecharacters.Todothis,we tageachcharacterwithoneofthefourtags,LL, RR, MM and LR, depending on its position withinaword.ItistaggedLLifitoccursonthe leftboundaryofaword,andformsawordwith thecharacter(s)onitsright.ItistaggedRRifit occurs on the right boundary of a word, and formsawordwiththecharacter(s)onitsleft.It istaggedMMifitoccursinthemiddleofaword.</Paragraph>
    <Paragraph position="3"> dollars in per capita GDP by the end of the century.</Paragraph>
    <Paragraph position="4"> Given a manually segmented corpus, a POC-tagged corpus can be derived trivially with perfectaccuracy.Thereasonthatweusesuch POC-taggedsequencesofcharactersinsteadof applying n-gram rules to a segmented corpus directly (Hockenmaier and Brew 1998, Xue  resolution in which the correct POC tag is determinedamongseveralpossiblePOCtagsin a specific context. Our next step is to train a  maximumentropymodelontheperfectlyPOCtaggeddataderivedfromamanuallysegmented null  corpusandusethemodeltoautomaticallyPOCtagunseentext. null</Paragraph>
    <Section position="1" start_page="0" end_page="111" type="sub_section">
      <SectionTitle>
2.2 Themaximumentropytagger
</SectionTitle>
      <Paragraph position="0"> The maximum entropy model used in POS-tagging is described in detail in Ratnaparkhi (1996)andthePOCtaggerhereusesthesame probability model. The probability model is defined over H x T , where H is the set of</Paragraph>
      <Paragraph position="2"> howthecurrentcharactershouldbePOC-tagged.</Paragraph>
      <Paragraph position="3"> For example, a punctuation mark is generally treated as one segment in the CTB corpus.</Paragraph>
      <Paragraph position="4"> Therefore,ifacharacterisapunctuationmark, then it should be POC-tagged LR. This also meansthatthepreviouscharactershouldclosea wordandthefollowingcharactershouldstarta word. When the training is complete, the featuresandtheircorrespondingparameterswill be used to calculate theprobability of the tag sequence of a sentence when the tagger tags unseen data. Given a sequence of characters  tagged corpus most like the reference corpus. The maximum gain is calculated with an evaluation function which quantifies the gain and takes the largest value. The rules are instantiations of a set of pre-defined rule templates.Aftertherulewiththemaximumgain is found, it is applied to the maxent-tagged corpus,whichwillbetterresemblethereference corpusasaresult.Thisprocessisrepeateduntil the maximum gain drops below a pre-defined threshold, which indicates improvement achievedthroughfurthertrainingwillnolonger be significant. The training will then be  a. The preceding (following) character is tagged z.</Paragraph>
      <Paragraph position="5"> b.Thecharactertwobefore(after)istagged z.</Paragraph>
      <Paragraph position="6"> c. One of the two preceding (following) charactersistagged z.</Paragraph>
      <Paragraph position="7"> d. One of the three preceding (following) charactersistagged z.</Paragraph>
      <Paragraph position="8"> e.Theprecedingcharacteristagged zandthe followingcharacteristagged w.</Paragraph>
      <Paragraph position="9"> f. The preceding (following) character is tagged zandthecharactertwobefore(after)was tagged w.</Paragraph>
      <Paragraph position="10"> g.Thepreceding(following)characteris c.</Paragraph>
      <Paragraph position="11"> h.Thecharactertwobefore(after)is c.</Paragraph>
      <Paragraph position="12"> i. One of the two preceding (following) charactersis c.</Paragraph>
      <Paragraph position="13"> j. The current character is c and the  preceding(following)characterisx .</Paragraph>
      <Paragraph position="14"> k. The current character is c and the preceding(following)characteristagged z. where a, b, zand warevariablesoverthesetof fourtags(LL,RR,LR,MM) Therankedsetofruleslearnedinthistraining process will be applied to the output of the maximumentropytagger.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="111" end_page="111" type="metho">
    <SectionTitle>
3 Experimentalresults
</SectionTitle>
    <Paragraph position="0"> We conducted three experiments. In the first experiment, we used the maximum matching algorithmtoestablishabaseline,ascomparing results across different data sources can be difficult. This experiment is also designed to demonstrate that even with a relatively small number of new words in the testing data, the segmentation accuracy drops sharply. In the second experiment, we applied the maximum entropymodeltotheproblemofChineseword segmentation. The results will show that this approach alone outperforms the state-of-the-art resultsreportedinpreviousworkinsupervised machine-learning approaches. In the third experimentwecombinedthemaximumentropy model with the error-driven transformation-based model. We used the error-driven transformation-based model to learn a set of rules to correct the errors produced by the maximumentropymodel.Thedataweusedare fromthePennChineseTreebank(Xia etal. 2000, Xue et al . 2002) and they consist of Xinhua newswire articles. We took 250,389 words (426,292charactersor hanzi)worthofmanually segmented data and divided them into two chunks. The first chunk has 237,791 words (404,680 Chinese characters) and is used as training data. The second chunk has 12,598 words(21,612characters)andisusedastesting data. These data are used in all three of our experiments.</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.1 ExperimentOne
</SectionTitle>
      <Paragraph position="0"> In this experiment, we conducted two subexperiments. In the first sub-experiment, we</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.2 ExperimentTwo
Inthisexperiment,amaximumentropymodel
</SectionTitle>
      <Paragraph position="0"> was trained on a POC-tagged corpus derived from the training data described above. In the testingphase,thesentencesinthetestingdata werefirstsplitintosequencesofcharactersand thentaggedthismaximumentropytagger.The taggedtestingdataarethenconvertedbackinto word segments for evaluation. Note that converting a POC-tagged corpus into a segmentedcorpusisnotentirelystraightforward wheninconsistenttaggingoccurs.Forexampleit is possible that the tagger assigns a LL-LR sequencetotwoadjacentcharacters.Wemade noefforttoensurethebestpossibleconversion.</Paragraph>
      <Paragraph position="1"> The character that is POC-tagged LL is invariably combined with the following character,nomatterhowitistagged.</Paragraph>
    </Section>
    <Section position="3" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.3 ExperimentThree
</SectionTitle>
      <Paragraph position="0"> In this experiment, we used the maximum entropy model trained in experiment two to automaticallytagthetrainingdata.Thetraining accuracy of the maximum entropy model is 97.54% in terms of the number of characters  and correctedtesting data were converted into word segments. Again, no effort was made to optimize the segmentation accuracy during the conversion.</Paragraph>
    </Section>
    <Section position="4" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.4 Evaluation
Inevaluatingourmodel,wecalculatedboththe
</SectionTitle>
      <Paragraph position="0"> tagging accuracy and segmentation accuracy.</Paragraph>
      <Paragraph position="1"> The calculation of the tagging accuracy is straightforward.Itissimplythetotalnumberof correctlyPOC-taggedcharactersdividedbythe total number of characters. In evaluating segmentationaccuracy,weusedthreemeasures: precision,recallandbalancedF-score.Precision (p) is defined as the number of correctly segmentedwordsdividedbythetotalnumberof words in the automatically segmented corpus.</Paragraph>
      <Paragraph position="2">  accuracy drops to only 89.77% in F-score. In contrast,themaximumentropytaggerachieves an accuracy of 94.89% measured by the balancedF-scoreevenwhentherearenewwords in the testing data.Thisresultis only slightly lower than the 95.15% that the maximum matchingalgorithmachievedwhenthereareno new words. The transformation-based tagger improvesthetaggingaccuracyby0.12%from 95.95%to96.07%.Thesegmentationaccuracy jumps to 95.17% (F-score) from 94.89%, an increase of 0.28%. That fact that the improvementinsegmentationaccuracyishigher thantheimprovementintaggingaccuracyshows that the transformation-based tagger is able to correctsomeoftheinconsistenttaggingerrors producedbythemaximumentropytagger.This is clearly demonstrated in the five highest-ranked transformation rules learned by this model:  range between 89.4% and 98.6%, averaging 94.4%. Since the data are also from Xinhua newswire, some comparison can be made between our results and this model. With less  is more robust than the dictionary-based approaches. They also show that the present approach outperforms other state-of-the-art machine-learningmodels.Wecanalsoconclude thatthemaximumentropymodelisapromising supervisedmachinelearningalternativethatcan be effectively applied to Chinese word segmentation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML