File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1010_metho.xml
Size: 13,470 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1010"> <Title>REFERENCES</Title> <Section position="4" start_page="66" end_page="66" type="metho"> <SectionTitle> DICTIONARY REPRESENTATION </SectionTitle> <Paragraph position="0"> The lexicon of basic words and stems is represented as a weightedfinite-state tranducer (WFST) (Pereira et al., 1994). Most transitions represent mappings between hanzi and pronunciations, and are costless. Transitions between orthographic words and their parts-of-speech are represented by e-to-category transductions and a unigram cost (negative log probability) of that word estimated from a 20M hanzi training corpus; a portion of the WFST is given in Figure 1. 2 Besides dictionary words, the lexicon contains all hanzi in the Big 5 Chinese code, with their pronunciation(s), plus entries for other characters (e.g., roman letters, numerals, special symbols).</Paragraph> <Paragraph position="1"> Given this dictionary representation, recognizing a single Chinese word involves representing the input as a finite-state acceptor (FSA) where each arc is labeled with a single hanzi of the input. The left-restriction of the dictionary WFST with the input FSA contains all and only the (single) lexical entries corresponding to the input. This WFST includes the word costs on arcs transducing c to category labels. Now, input 2The costs are actually for strings rather than words: we currently lack estimates for the words themselves. We assign the string cost to lexical entries with the likeliest pronunciation, and a large cost to all other entries. Thus ~j~/adv, with the commonest pronunciafionjiangl has cost 5.98, whereas ~/nc, with the rarer pronunciatJonjiang4, is assigned a high cost. Note also that the current model is zeroeth order in that it uses only unigram costs. Higher order models, e.g. bigram word models, could easily be incorporated into the present architecture if desired.</Paragraph> <Paragraph position="2"> sentences consist of one or more entries from the dictionary, and we can generalize the word recognition problem to the word segmentation problem, by leftrestricting the transitive closure of the dictionary with the input. The result of this left-restriction is an WFST that gives all and only the possible analyses of the input FSA into dictionary entries. In general we do not want all possible analyses but rather the best analysis.</Paragraph> <Paragraph position="3"> This is obtained by computing the least-cost path in the output WFST. The final stage of segmentation involves traversing the best path, collecting into words all sequences of hanzi delimited by part-of-speech-labeled arcs. Figure 2 shows an example of segmentation: the sentence \[\] 5~,~-~ &quot;How do you say octopus in Japanese?&quot;, consists of four words, namely \[\] ri4-wen2 'Japanese', ~, zhangl-yu2 'octopus', ,~ zen3-mo 'how', and -~ shuol 'say'. In this case, \[\] ri4 is also a word (e.g. a common abbreviation for Japan) as are 3~ wen2-zhangl 'essay', and ~, yu2 'fish', so there is (at least) one alternate analysis to be considered.</Paragraph> </Section> <Section position="5" start_page="66" end_page="68" type="metho"> <SectionTitle> MORPHOLOGICAL ANALYSIS </SectionTitle> <Paragraph position="0"> The method just described segments dictionary words, but as noted there are several classes of words that should be handled that are not in the dictionary. One class comprises words derived by productive morphological processes, such as plural noun formation using the suffix ~I menO. The morphological analysis itself can be handled using well-known techniques from finite-state morphology (Koskenniemi, 1983; Antworth, 1990; Tzoukermann and Liberman, 1990; Karttunen et al., 1992; Sproat, 1992); so, we represent the fact that ~ attaches to nouns by allowing c-transitions from the final states of all noun entries, to the initial state of the sub-WFST representing ~I.</Paragraph> <Paragraph position="1"> However, for our purposes it is not sufficient to represent the morphological decomposition of, say, plural nouns: we also need an estimate of the cost of the resulting word. For derived words that occur in our corpus we can estimate these costs as we would the costs for an underived dictionary entry. So, ~I jiang4-menO '(military) generals' occurs and we estimate its cost at 15.02. But we also need an estimate of the probability for a non-occurring though possible plural form like 15/)~I nan2-gual-menO 'pumpkins'. Here we use the Good-Turing estimate (Baayen, 1989; Church and Gale, 1991), whereby the aggregate probability of previously unseen members of a construction is estimated as NI/N, where N is the total number of observed tokens and N1 is the number of types observed only once. For r~l this gives prob(unseen(f~) I f~l), and to get the aggregate probability of novel ~l-constructions in a corpus we multiply this by prob,e~,(C/{~) to get probte~t(unseen(f~)). Finally, to estimate the probability of particular unseen word i~1/1 ~I, we use the simple bigram backoff model</Paragraph> <Paragraph position="3"> non-optimal analysis is shown with dotted lines in the bottom frame.</Paragraph> <Paragraph position="5"> cost(~r\]) is computed in the obvious way. Figure 3 shows how this model is implemented as part of the dictionary WFST. There is a (costless) transition between the NC node and ~\]. The transition from ~\] to a final state transduces c to the grammatical tag</Paragraph> <Paragraph position="7"> For the seen word ~1 'generals', there is an e:nc transduction from ~ to the node preceding t~\]; this arc has cost cost(~\]) - costt,~:t(unseen(~\])), so that the cost of the whole path is the desired cost(~t~\] ). This representation gives ~\] an appropriate morphological decomposition, preserving information that would be lost by simply listing ~\[~I as an unanalyzed form. Note that the backoffmodel assumes that there is a positive correlation between the frequency of a singular noun and its plural. An analysis of nouns that occur both in the singular and the plural in our database reveals that there is indeed a slight but significant positive correlation -- R 2 = 0.20, p < 0.005. This suggests that the backoff model is as reasonable a model as we can use in the absence of further information about the expected cost of a plural form.</Paragraph> </Section> <Section position="6" start_page="68" end_page="69" type="metho"> <SectionTitle> CHINESE PERSONAL NAMES </SectionTitle> <Paragraph position="0"> Full Chinese personal names are in one respect simple: they are always of the form FAMILY+GIVEN.</Paragraph> <Paragraph position="1"> The FAMILY name set is restricted: there are a few hundred single-hanzi FAMILY names, and about ten double-hanzi ones. Given names are most commonly two hanzi long, occasionally one-hanzi long: there are thus four possible name types. The difficulty is that GIVEN names can consist, in principle, of any hanzi or pair ofhanzi, so the possible GIVEN names are limited only by the total number of hanzi, though some hanzi are certainly far more likely than others. For a sequence of hanzi that is a possible name, we wish to assign a probability to that sequence qua name. We use an estimate derived from (Chang et al., 1992). For example, given a potential name of the form FI G1 G2, where F1 is a legal FAMILY name and G1 and G2 are each hanzi, we estimate the probability of that name as the product of the probability of finding any name in text; the probability of F1 as a FAMILY name; the probability of the first hanzi of a double GIVEN name being G1; the probability of the second hanzi of a double GIVEN name being G2; and the probability of a name of the form SINGLE-FAMILY+DOUBLE-GIVEN. The first probability is estimated from a name count in a text database, whereas the last four probabilities are estimated from a large list of personal names) This model is easily incorporated into the segmenter by building an WFST restricting the names to the four licit types, with costs on the arcs for any particular name summing to an estimate of the cost of that name. This WFST is then summed with the WFST implementing the dictionary and morphological rules, and the transitive closure of the resulting transducer is computed.</Paragraph> <Paragraph position="2"> 3We have two such lists, one containing about 17,000 full names, and another containing frequencies of hanzi in the various name positions, derived from a million names.</Paragraph> <Paragraph position="3"> There are two weaknesses in Chang et al.'s (1992) model, which we improve upon. First, the model assumes independence between the first and second hanzi of a double GIVEN name. Yet, some hanzi are far more probable in women's names than they are in men's names, and there is a similar list of male-oriented hanzi: mixing hanzi from these two lists is generally less likely than would be predicted by the independence model.</Paragraph> <Paragraph position="4"> As a partial solution, for pairs ofhanzi that cooccur sufficiently often in our namelists, we use the estimated bigram cost, rather than the independence-based cost.</Paragraph> <Paragraph position="5"> The second weakness is that Chang et al. (1992) assign a uniform small cost to unseen hanzi in GIVEN names; but we know that some unseen hanzi are merely accidentally missing, whereas others are missing for a reason -- e.g., because they have a bad connotation.</Paragraph> <Paragraph position="6"> We can address this problem by first observing that for many hanzi, the general 'meaning' is indicated by its so-called 'semantic radical'. Hanzi that share the same 'radical', share an easily identifiable structural component: the plant names 2, -~ and M share the GRASS radical; malady names m\[, ~,, and t~ share the SICKNESS radical; and ratlike animal names 1~, \[~, and 1~ share the RAT radical. Some classes are better for names than others: in our corpora, many names are picked from the GRASS class, very few from the SICKNESS class, and none from the RAT class. We can thus better predict the probability of an unseen hanzi occurring in a name by computing a within-class Good-Turing estimate for each radical class. Assuming unseen objects within each class are equiprobable, their probabilities are given by the Good-Turing theo-</Paragraph> <Paragraph position="8"> where p~t, is the probability of one unseen hanzi in class cls, E(N{ t') is the expected number of hanzi in cls seen once, N is the total number of hanzi, and E(N~ t') is the expected number of unseen hanzi in class cls. The use of the Good-Turing equation presumes suitable estimates of the unknown expectations it requires. In the denominator, the N~ u are well measured by counting, and we replace the expectation by the observation. In the numerator, however, the counts of N{ l' are quite irregular, including several zeros (e.g. RAT, none of whose members were seen), However, there is a strong relationship between N{ t&quot; and the number of hanzi in the class. For E(N~ZS), then, we substitute a smooth against the number of class elements. This smooth guarantees that there are no zeroes estimated. The final estimating equation is then: S( N~'N;,! (2) p~U oc N * The total of all these class estimates was about 10% off from the Turing estimate Nt/N for the probability of all unseen hanzi, and we renormalized the estimates so that they would sum to Nt/N.</Paragraph> <Paragraph position="9"> This class-based model gives reasonable results: for six radical classes, Table 1 gives the estimated cost for an unseen hanzi in the class occurring as the second hanzi in a double GIVEN name. Note that the good classes JADE, GOLD and GRASS have lower costs than the bad classes SICKNESS, DEATH and RAT, as desired.</Paragraph> </Section> <Section position="7" start_page="69" end_page="69" type="metho"> <SectionTitle> TRANSLITERATIONS OF FOREIGN WORDS </SectionTitle> <Paragraph position="0"> Foreign names are usually transliterated using hanzi whose sequential pronunciation mimics the source language pronunciation of the name. Since foreign names can be of any length, and since their original pronunciation is effectively unlimited, the identification of such names is tricky. Fortunately, there are only a few hundred hanzi that are particularly common in transliterations; indeed, the commonest ones, such as ~ bal, ~I er3, and PJ al are often clear indicators that a sequence of hanzi containing them is foreign: even a name like ~Y~f xia4-mi3-er3 'Shamir', which is a legal Chinese personal name, retains a foreign flavor because of i~J. As a first step towards modeling transliterated names, we have collected all hanzi occurring more than once in the roughly 750 foreign names in our dictionary, and we estimate the probability of occurrence of each hanzi in a transliteration (pT~;(hanzii)) using the maximum likelihood estimate. As with personal names, we also derive an estimate from text of the probability of finding a transliterated name of any kind (PTN). Finally, we model the probability of a new transliterated name as the product of PTN and pTg(hanzii) for each hanzii in the putative name. 4 The foreign name model is implemented as an WFST, which is then summed with the WFST implementing the dictionary, morphological rules, and personal names; the transitive closure of the resulting machine is then computed.</Paragraph> </Section> class="xml-element"></Paper>