File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1085_metho.xml

Size: 12,781 bytes

Last Modified: 2025-10-06 14:13:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1085">
  <Title>USE OF LEXICAL AND SYNTACTIC TECHNIQUES IN RECOGNIZING HANDWRITTEN TEXT</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. ISOLATED HANDWRITTEN
WORD RECOGNITION (WR)
</SectionTitle>
    <Paragraph position="0"> This research employs both off-line \[3\] and on-line word recognizers \[4\]. The actual WK is implemented as a three-stage procedure. In the first stage, wholistic features of the word are used to reduce the lexicon from 21,000 words to approx*This work was supported in part by NSF grant IRI-9315006</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="427" type="metho">
    <SectionTitle>
3. TRAINING CORPUS, LEXICON
</SectionTitle>
    <Paragraph position="0"> A database of representative text is crucial for this research.</Paragraph>
    <Paragraph position="1"> We are using an electronic corpus consisting of several thousand e-mail messages which is best categorized as intradepartmental communication (e.g., meeting notifications, requests for informa- tion, etc.). The style of language used in e-mail reflects that used in handwriting: informal, ungrammatical at times, relatively short sentences, etc. Such a training set has been collected and has being tagged using the Xerox POS tagger. We employ a 21,000 word lexicon derived from this e-mail corpus which is represented as atrie to permit efficient access.</Paragraph>
    <Paragraph position="2"> 1The simulator assumes perfect performance for the wholistic lexicon reduction stage; the actual module performs with better than 95% accuracy.</Paragraph>
    <Paragraph position="3">  my alarm code soil rout wake drcle raid hot shute risk list clock visit riot mail most thtaMs . having up running this loving  not among top choices</Paragraph>
  </Section>
  <Section position="6" start_page="427" end_page="428" type="metho">
    <SectionTitle>
4. LEXICAL ANALYSIS USING
COLLOCATIONAL INFORMATION
</SectionTitle>
    <Paragraph position="0"> This module applies collocational information \[5\] in order to modify word neighbourhoods generated by the WR. These modified neighbourhoods are then input to a statistical syntax analysis module which makes final word choices. Collocations are word patterns that occur frequently in language; intuitively, if word A is present, there is a high probability that word B is also present. We use Xtract to find collocations in a 2.1 million word portion of the Wall Street Journal corpus ~. Collocations are categorized based on (i) the strength of their association (mutual information score, mis) and (ii) the mean and variance of the separation between them. At this point we are considering only fixed collocations such as compound nouns (e.g., &amp;quot;computer scientist&amp;quot;, &amp;quot;letter of intent&amp;quot;), and lexico-syntactic collocations (e.g., &amp;quot;giving up') which are categorized by low variance in their separation. In this training set, &amp;quot;significant&amp;quot; collocations occur at the rate of approximately 2.6 per sentence, thus making it worthwhile to perform collocational analysis.</Paragraph>
    <Paragraph position="1"> Specifically, collocational analysis can result in the following actions (ranked from conservative to aggressive): (i) re-rank the word choices thereby promoting more likely words, (ii) eliminate word choices thereby reducing word neighbourhoods, or (iii) propose new words (not in the top n choices of the WR). The first two actions are possible only if, for each word in the multi-word collocation, the correct word is among the top n choices of the WR. The last action does not have this restriction and constitutes a form of error detection and correction.</Paragraph>
    <Paragraph position="2"> Based on the (i) the strength of the collocation that a word choice participates in, mis(xy), and (ii) the confidence given to this word by the WR wr_conf(x), a decision is made whether to simply promote a word choice (i.e., increase its rank) or to promote it to the top choice and eliminate all other word choices for each of the word neighbourhoods par2Due to the currently inadequate size of the e-mail corpus, we are temporarily conducting our experiments on the WSJ corpus.</Paragraph>
    <Paragraph position="3"> ticipating in the collocation. We compute the lexically adjusted score of the word las(z) = mis(xy) + wr_conf(x); if a word does not participate in any collocation with an adjacent word, its score remains the same. The word choices are then re-ranked based on any new scores. There are two special actions which are taken:  1. If one word in a collocation is promoted to top choice, the remaining words (if they are one of the top choices) are also promoted to the top choice.</Paragraph>
    <Paragraph position="4"> 2. If the confidences of word choices fall below a certain threshold t (based on the difference between it and the  top choice), then they are ehminated from further consideration. null This is illustrated in Figure 3.</Paragraph>
    <Paragraph position="5">  Based on a test set of 1025 words from the WSJ, collocational analysis improved the percentage correct in the top choice from 67% to 72.5%. We are experimenting with various thresholds for deleting word choices which minimizes the error. We are in the process of extending collocational analysis by using one-sided information score. For example, the word 'offended' is frequently followed by the word 'by',  (hybrid syntax models,,\[ semantic danes) \] but the word 'by' may be preceded by virtually anything.</Paragraph>
    <Paragraph position="6"> Such analysis extends the utility of collocational analysis but comes with a risk of promoting incorrect word choices.</Paragraph>
    <Paragraph position="7"> Action (iii), namely proposing new words, is based on the visually similar neighbourhood (VSN) of a word choice. The VSN of a word is computed by the same process that is used by the WR to reduce a lexicon based on wholistic properties of a word. In cases where the reduced lexicon is still too large (over 200 words), more stringent constraints (such as word length) are applied in order to reduce the size even further. The VSN is computed automatically from the ASCII representation of a word. For example, if the correct words axe &amp;quot;nuclear power&amp;quot; and the set of word choices result in &amp;quot;nucleus power&amp;quot; and &amp;quot;mucus power&amp;quot;, collocational analysis results in the additional word choice &amp;quot;nuclear&amp;quot;. This is based on the fact that (i) &amp;quot;nuclear&amp;quot; is in the VSN of &amp;quot;nucleus&amp;quot; and (ii) the words &amp;quot;nuclear power&amp;quot; constitute a strong collocation. This method is currently being attempted for only a small set of strong collocations.</Paragraph>
  </Section>
  <Section position="7" start_page="428" end_page="429" type="metho">
    <SectionTitle>
5. SYNTACTIC MODELS: USING POS
TAGS TO REDUCE WORD
NEIGHBOURHOODS
</SectionTitle>
    <Paragraph position="0"> The performance of a WR system can be improved by incorporating statistical information at the word sequence level. The performance improvement derives from selection of lower-rank words from the WR output when the surrounding context indicates such selection makes the entire sentence more probable. Given a set of output words .X which emanate from a noisy channel (such as an WR), N-gram word models \[6\] seek to determine the string of words VV which most probably gave rise to it. This amounts to finding the string ITV for which the a posteriori probability</Paragraph>
    <Paragraph position="2"> is maximum, where P(X \] W) is the probability of observing X when W is the true word sequence, P(VV) is the a priori probability of W and P(X) is the probability of string X. The values for each of the P(Xi \[ Wi) are known as the channel (or confusion) probabilities and can be estimated empirically. If we assume that words are generated by an nth order Maxkov source, then the a priori probability P(W) can be estimated as</Paragraph>
    <Paragraph position="4"> where P(Wn I Wh ..... Wk-z) is called the nth-order transitional probability. The Viterbi algorithm \[7\] is a dynamic method of finding optimal solutions to the above quantity.</Paragraph>
    <Paragraph position="5"> The problem with such approaches is that as the number of words grow in the vocabulary, estimating the parameters reliably becomes difficult. More specifically, the number of low or zero-valued entries in the transition matrix starts to rise exponentially. \[8\] reports that of the 6.799 X 10 ldeg 2grams that could possibly occur in a 365,893,263 word corpus (consisting of 260,740 unique words), only 14,494,217 actually occured, and of these, 8,045,024 occured only once.</Paragraph>
    <Paragraph position="6"> In n-gram class models, words axe mapped into syntactic \[9\] classes. In this situation, p(wt I wt-1) becomes: p(w, I w,_l) = p(~,, I C(w,)) p(C(w,) I C(w,_,)) where p(C(wt) I C(wt-1)) is the probability to get to the class C(wt) following the class C(wt-1) and p(wt I C(wt)) is the probability to get the word wt among the words of the class C(wt).</Paragraph>
    <Paragraph position="7"> The research described here uses n-gram class models where paxt-of-speech (POS) tags are used to classify words. We use the notation A : B to indicate the case where word A has been assigned the tag B. For each sentence analyzed, we  form a word:tag lattice representing all possible sentences for the set of word choices output by string matching (see figure 4) 3. The problem is to find the best path(s) through this lattice. Computation of the best path requires the following information: (i) tag transition statistics, and (ii) word probabilities.</Paragraph>
    <Paragraph position="8"> Transition probabilities describe the likelihood of a tag following some preceding (sequence of) tag(s). These statistics are calculated during training as:</Paragraph>
    <Paragraph position="10"> Beginning- and end- of-sentence markers axe incorporated as tags themselves to obtain necessary sentence-level information. null Word probabilities are defined (and calculated during training) as:</Paragraph>
    <Paragraph position="12"> The above statistics have been computed for the e-mail corpus. The Xerox POS tagger \[10\] has been employed to tag the corpus; the tagset used is the Penn treebank tagset. The advantage of the Xerox tagger is the ability to train it on an untagged corpus.</Paragraph>
    <Paragraph position="13"> The Viterbi algorithm is used to find the best Word:Tag sequence through the lattice, i.e., the maximal value of the following quantity:</Paragraph>
    <Paragraph position="15"> over all possible tag sequences T = Tago, Taga .... Tag,+1 where Tago and Tag,b+1 are the beginning-of-sentence and end-of-sentence tags respectively. The Viterbi algorithm allows the best path to be selected without explicitly enumerating all possible tag sequences. A modification to this algorithm produces the best n sequences.</Paragraph>
    <Paragraph position="16"> The lattice of Figure 4 demonstrates this procedure being used to derive the correct tag sequence even when the correct word ('the') was not output by the WR. The chosen path is illustrated in boldface. The values on the edges represent tag transition probabilities and the node values represent word probabilities. Analysis showed that the correct tag most frequently missing from the lattice was the DT (determiner) tag. Thus, the DT tag is automatically included in the lattice in all cases of short words (&lt; 4 characters) where it was not otherwise a candidate.</Paragraph>
    <Paragraph position="17"> A test set of 140 sentences from the e-mail corpus produced the results shown in Figure 5. The percentage of words correctly recognized as the top choice increased from 51% to 61% using this method; ceiling is 70% due to correct word choice being absent in WR output. Furthermore, by eliminating all word choices that were not part of the top 20 3the presence of the DT tag in the trellis is explained below sequences output by the Viterbi, a reduction in the average word neighbourhood of 56% (from 4.4 to 1.64 choices/word) was obtained with an error rate of only 3%. The latter is useful if a further language model is to be applied (e.g., semantic analysis) since fewer word choices, and therefore fax fewer sen- tence possibilities remain.</Paragraph>
    <Paragraph position="18"> While this method is effective in reducing word neighbourhood sizes, it does not seem to be effective in determining the correct/best 4 sentence (the ultimate objective) or in providing feedback. We are investigating hybrid models (combining syntax and semantics) for achieving this.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML