File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2156_metho.xml
Size: 7,688 bytes
Last Modified: 2025-10-06 14:07:17
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2156"> <Title>Decision-Tree based Error Correction for Statistical Phrase Break Prediction in Korean *</Title> <Section position="3" start_page="0" end_page="1051" type="metho"> <SectionTitle> 2 Features of Korean </SectionTitle> <Paragraph position="0"> This section brMly explains the linguistic characterists of spoken Korean before describing the phrase break prediction.</Paragraph> <Paragraph position="1"> 1) A Korean word consists of more than one morpheme with clear-cut morphenm boundaries (Korean is all agglutinative language). 2) Korean is a postpositional language with many kinds of noun-endings, verb-endings, and prefinal verb-endings. These functional morphemes determine a noun's case roles, a verb's tenses, modals, and modification relations betwcen words. 3) Korean is basically an SOV language but has relatively free word order compared to other rigid word-order languages such as English ,except tbr the constraints that the verb must appear in a sentence-final position. However, in Korean, some word-order constraints actually do exist such that the auxiliary verbs representing modalities must follow the main verb, and modifiers must be placed betbre the word (called head) they modify. 4) Phonological changes can occur in a morpheme, between morphemes in a word, and even between words in a phrase, but not between phrases.</Paragraph> </Section> <Section position="4" start_page="1051" end_page="1052" type="metho"> <SectionTitle> 3 Hybrid Phrase Break Detection </SectionTitle> <Paragraph position="0"> Part-of speech (POS) tagging is a basic step to phrase break prediction. POS tagging systems have to handle out-of vocabulary (OOV) words for an unlimited vocabulary TTS system. Figure 1 shows the architecture of our phrase break predictor integrated with the POS tagging system. The POS tagging system eraploys generalized OOV word handling mechanisms in the morphological analysis and cascades statistical and rule-based approaches in the two-phase training architecture tbr POS disambiguation. null Tire probabilistic phrase break predictor segments the POS sequences into several phrases according to word trigram probabilities. Tire irdtial phrase break tagged morpheme sequence is corrected with the error correcting tree learned by the C4.5 (Quinlan, 1983).</Paragraph> <Paragraph position="1"> Tire next two subsections will give detailed descriptions of the probabilistic phrase prediction and error correcting tree learning. The hybrid POS tagging system will not l)e explained in this paper, and the interested readers can see (Cha et al., 1998) tbr further reference.</Paragraph> <Section position="1" start_page="1051" end_page="1052" type="sub_section"> <SectionTitle> 3.1 Probabilistie Phrase Break Detection </SectionTitle> <Paragraph position="0"> For phrase break prediction, we develop tire word POS tag trigrmn model. Some experiments are performed on all the possible trigram sequences and 'word-tag word-tag break wordtag' sequence turns out to be the most fl'uitful of any others, which are the same results as the previous studies in English (Sanders, 1995).</Paragraph> <Paragraph position="1"> The probability of a phrase break bi appearing after the second word POS tag is given by</Paragraph> <Paragraph position="3"> where C is a frequency count flmction and b0, bl and b2 mean no break, minor break and major break, respectively. Even with a large number of training patterns it is very clear that there will be a number of word POS tag sequences that never occur or occur only once in the training corpus. One solution to this data sparseness problem is to smooth the probabilities by using the bigram and unigram probabilities, which adjusts the fl'equency counts of rare or non-occurring POS tag sequences. We use the smoothed probabilities:</Paragraph> <Paragraph position="5"> where )~1, A2 and ),a are three nommgative constants such that h I q- ~2 Jr- ~3 = 1. In some experiments, we can get the weights ~1, ~2 and A3 as 0.2, 0.7 and 0.1, respectively.</Paragraph> <Paragraph position="6"> Previous researchers of phrase break prediction used mainly content-flnmtion word rule, wherel)y a phrase break is placed before every flmction word that follows a content word (Allen and Hmmicut, 1987) (Taylor et al., 1991). The researchers used tag set size of only 3, including function, content ~md t)lmctuation in the rule. However, Korean is a post-positional agglntinative language. If the eontent-t'unction word rule is to be adapted in Korean, the rule nmst be changed so that a phrase break is placed before every content mort/henm that R)llows a fimction morl)heme. Unfortunately this rule is very inet\[icient in Korean since it tends to create too many pauses. In our works, only the POS tags of Nnction mort)heroes are used be, cause the function morphelnes constrain the classes of precedent n:orpheanes and t)b\y important roles in syntactic relation. So, each word is represented by the I?OS tag of its fimction morpheme. In the case of the word which has no function mort)heine , simplified POS tags of content mort)henms are used. The nmnber of POS tags use, d in this rese, m'ch is a2.</Paragraph> </Section> <Section position="2" start_page="1052" end_page="1052" type="sub_section"> <SectionTitle> 3.2 Decision-Tree Based Error Correction </SectionTitle> <Paragraph position="0"> The t)robabilistic phrase break prediction only covers a limited range of contextual infornmtion, i.e. two preceding words and one. following word. Moreove, r, the module can not utilize the morl)heme tag se.lectively and relative distance to the other phrase breaks. For this reason we designed error correcting tree to con> pensate for |;tie limitations of the. probal)ilistic phrase break prediction. However, designing error corre, cting rules with knowledge engineering is te, dious and error-prone,, lTnstead, we, adopte, d decision tree learning ai)proa(:h to auton~atically learn the error correcting rules froln a correctly t:,hrase break tagged eorlms.</Paragraph> <Paragraph position="1"> Most algorithms th:~t have t)een develope, d for lmilding decision trees employ a top-down, greedy search through the space of possible decision trees (Mitchell, 1997). The 04.5 (Quinlan, 1983) is adequate to buiht a decision tree easily for successively dividing the regions of feature vector to minimize the prediction error. It also uses intbrmation gain which lneasures how well a given attril)ute separates the training vectors according to their target classification in order to select the most critical attrilmtes at each step while growing |;tit tree (hence the nmne is IG~l'ree). Now, we utilize it for correcting the initially phrase break tagged POS tag sequences generated by probabilistic predictor.</Paragraph> <Paragraph position="2"> However, wc invented novel way of using the decision tree as trmlsibrmation-t)ased rule induction (Brill, 1992). l?igure, 2 shows the tree learning architecture tbr phrase break error correction. The initial phrase break tagged POS tag sequences support the ti;ature vectors tbr attributes which are used tbr decision making. Because the ii;atm'e vectors include phrase break sequences as well as POS tag sequences, a learned decision tree can check |;lie morphenm tag selectively and utilize |;lie relative distanee to the other phrase breaks. The correctly phrase break tagged POS tag sequences support the classes into which the feature vectors are classified. C4.5 lmilds a decision tree fl'om the t)airs which consist of the feature vectors and their classes.</Paragraph> </Section> </Section> class="xml-element"></Paper>