File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2198_metho.xml
Size: 8,996 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2198"> <Title>WORD CLASS DISCOVERY FOR POSTPROCESSING CHINESE HANDWRITING RECOGNITION</Title> <Section position="3" start_page="1221" end_page="1221" type="metho"> <SectionTitle> 2. WORD CLASS DISCOVERY </SectionTitle> <Paragraph position="0"> We describe in this section the problem of corpus-based word class discovery and the simulated annealing approach tbr the problem.</Paragraph> <Section position="1" start_page="1221" end_page="1221" type="sub_section"> <SectionTitle> 2.1 The problem </SectionTitle> <Paragraph position="0"> Let T= Wl,W2, ...,wL be a text corpus with L words; V = vl, v~, ..., VNv be the vocabulary composed of the NV distinct words in T; and C = C1,C2,...,CNc be the set of classes, where NC is a predefined number of classes. The word class discovery problem can be for~ mulated as follows: Given V and C (with afixed NC), find a class assignment C/ from V to C which maximizes the estimated probability of T, \[~(T), according to a specific probabilistic language model.</Paragraph> <Paragraph position="1"> For a class bigram model, find C/ : V --+ C to maximize ~(T) = ~I/L=I p(wi IC/(wl))p(C/(wi)lC/(wi-1)))) Alternatively, perplexity (Jardino an d Adda, 1993) or average mutual information (Brown et al., 1992) can be used as the characteristic value for optimization.</Paragraph> <Paragraph position="2"> Perplexity, PP, is a well-known quality metric for language models in speech recognition: PP = /5(T)-~.</Paragraph> <Paragraph position="3"> The perplexity for a class bigram model is:</Paragraph> <Paragraph position="5"> where wj is the j-th word in the text and ~b(wj) is the class that wj is assigned to.</Paragraph> <Paragraph position="6"> For class N-gram models with fixed NC, lower perplexity indicates better class assignment of the words. The word class discovery problem is thus defined: find the class assignment of the words to minimize the perplexity of the training text.</Paragraph> </Section> <Section position="2" start_page="1221" end_page="1221" type="sub_section"> <SectionTitle> 2.2 The simulated annealing approach </SectionTitle> <Paragraph position="0"> The word class discovery problem can be considered as a combinatorial optimization problem to be solved with a simulated annealing approach. Jardino and Adda (1993) used the approach for antomatically classifying French and German words. The four components (Kirkpatrick et al,, 1983) of a simulated annealing algorithm are (1) a specification of configuration, (2) a random move generator for rearrangements of the elements in a configuration, (3) a cost tim(:lion for evaluating a configuration, (4) an annealing s('hedule that specifies time and duration to decrease the control parameter (or temperature). The configuration is clearly the class assignment q~, for the word class discovery problem. The move generator is also straightforward -- randomly choosing a word to be re-assigned to a randomly chosen class. Perplexity can serve as the cost fimction to evaluate the quality of word classification. The Metropolis algorithm specifies the annealing schedule. The discovery procedure is thus: (1) Initialize: Assign the words randomly to the predefined number of classes to have an initial configuration; (2) Move: R,eassign a randomly selected word to a randomly selected class (Monte Carlo principle); (3) Accept or Backtrack: If the perplexity is changed within a controlled limit (decreases or increases within limit), the new configuration is accepted; otherwise, undo the reassignment (Metropolis Mgorithm, see betow); and (4) Loop: Iterate the above two steps until the perplexity converges.</Paragraph> <Paragraph position="1"> Metropolis algorithm (Jardino and Adda, 1993): The original Monte Carlo optimization accepts a new configuration only if the perplexity decreases, suffers from the local minimum problem. Metropolis et al.</Paragraph> <Paragraph position="2"> proposed in 1953 that a worse configuration can be accepted according to the control parameter cp. The new configuration is accepted if cxp(APP/cp) is greater than a random number between 0 and 1, where APP is the difference of perplexities for two consecutive steps.</Paragraph> <Paragraph position="3"> cp is decreased logarithmically (multiplied by an annealing factor a f) after a fixed number of iterations.</Paragraph> </Section> </Section> <Section position="4" start_page="1221" end_page="1222" type="metho"> <SectionTitle> 3. CONTEXTUAL POSTPROCESSING OF HANDWRITING RECOGNITION </SectionTitle> <Paragraph position="0"> The problem of contextual postprocessing can be described as follows: The character recognizer produces top K candidates (with similarity score) for each character in the input stream; the postprocessor then decides which of the K candidates is correct based on the context and a language model. Let the recognizer prodace the candidate matrix M for the input sequence of</Paragraph> <Paragraph position="2"> the postprocessor is to find the combination with highest probability according to the language model:</Paragraph> <Paragraph position="4"> The overall probability can be divided into two parts: pattern recognition probability and linguistic probability, P(OI M) = f'pn(OlM) * PLM(OIM). The former is produced by the recognizer, while the latter is defined by thr language model.</Paragraph> <Paragraph position="5"> This problem can be reformnlated as one of finding the optimal path in a word lattice, since word is the smallest meaningful unit in the Chinese language. The word lattice is formed with the words proposed by a word hypothesizer, which is composed of a dictionary marcher and some lexical rules. Thus, PrM(O\[M) = max~l~paths P(path), where a path is a word sequence formed by a character combination in M.</Paragraph> <Section position="1" start_page="1221" end_page="1222" type="sub_section"> <SectionTitle> 3.1 Least-word model (LW) </SectionTitle> <Paragraph position="0"> A simple language model is based on a dictionary (actually a wordlist). The characteristic function of the model is the number of words in the word-lattice path.</Paragraph> <Paragraph position="1"> The best path is simply one with the least number of words, l'cM (OIM) -: (-1)* #words-in-the-path. This is similar to the principle of Max|reran Mal.ching ill Chinese word segmentation.</Paragraph> </Section> <Section position="2" start_page="1222" end_page="1222" type="sub_section"> <SectionTitle> 3.2 Word-fr(:queney model (WF) </SectionTitle> <Paragraph position="0"> Another simple model is based on the word frequencies of the words in the word-lattice pai;h. '\['his can be considered as a word unigram language model. The path probal)ility is tit(; product of word probabilities of the words in the path.</Paragraph> <Paragraph position="1"> a.3 Inter-word ('haraeter bigram model (IWC B) l,ee b. el aL (1993) recently presented a novel |(lea called word-latticcobased Chinese character bigram for Chinese language modeling. Basically, they approxi~ mate the eii)ct of word I)igr;mls by applying character bigrams to the boundary characters of adjacent words.</Paragraph> <Paragraph position="2"> 'l!he approach is simple and very effective. \]t can also be considered as one of class-base.d bigram models, us ing morl)hological features the lirst and last characters of a word. Wc luM implemented a variation of the model, called inter-word character l)igram model.</Paragraph> <Paragraph position="3"> Word probal)ilities and Chinese character bigrams wer('.</Paragraph> <Paragraph position="4"> built from the 10-million-character UI) ('orlms. 'l?he path probability is computed as the product ol&quot; word probabilities and inter-word character bigram probabilities of the words in the path. This model is one of the best among the existing Chinese language models, and has been successfully applic, d to Chinese homot)hone (lisambiguation and linguistic decoding (l,ee /,. c~ el., 1993).</Paragraph> </Section> <Section position="3" start_page="1222" end_page="1222" type="sub_section"> <SectionTitle> 3.4 Discovered ('lass 1)|gram model </SectionTitle> <Paragraph position="0"> Our novel language model uses the word classes dis covered by the simulated anneMing procedure as the 1)asis of (:lass bigram language model. The ram,her of classes (NC) can be selected according I;o the size of training corl)uS.</Paragraph> <Paragraph position="1"> Every word in the training corI)uS is assigned to a certain class after the training process converges with a minimal perplexity. Thus, we can store the class in-dices in the corresponding le.xicM entries in the dictionary. Words in a word-lattice path ;~re then au|;otllatieMly mapped to the. class indices through dictionary look-up. The path 1)robability is thus the product o\[' lexical l)robabilities and contextuM class bigram l)rob abilities, as in a usual (:lass bigrmn language model.</Paragraph> </Section> </Section> class="xml-element"></Paper>