File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1030_metho.xml

Size: 9,315 bytes

Last Modified: 2025-10-06 14:13:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1030">
  <Title>IMPROVING CHINESE TOKENIZATION WITH LINGUISTIC FILTERS ON STATISTICAL LEXICAL ACQUISITION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
HOW TO HANDLE DOUBLE STANDARDS
</SectionTitle>
    <Paragraph position="0"> Current evaluation practice favors overly optimistic accuracy estimates. Because partially-tokenized words are usually evaluated as being correctly tokenized, failures to tokenize unknown words can be overlooke d . For example, what makes ~l~JJ~ (yufin zhh j~n, a charity) a single word when iBJ~llJJ and are both legitimate words.'? One answer is that translating the partially-tokenized segments individually can yield &amp;quot;assistance gold&amp;quot; or &amp;quot;aid currency&amp;quot;, instead of the unquestionably correct &amp;quot;charity&amp;quot; or &amp;quot;charity fund&amp;quot;. Another answer is that a speech synthesizer should never pause between the two segments; otherwise ~g)J is taken as a verb and ~i~ as a surname, changing the meaning to &amp;quot;help Gold&amp;quot;. A blind evaluation paradigm is needed that accommodates disagreement between human judges, yet does not bias the judges to accept the computer's output too generously.</Paragraph>
    <Paragraph position="1"> We have devised a procedure called nk-blind that uses n blind judges' standards. The n judges each hand-segment the test sentences independently, before the algorithm is run.</Paragraph>
    <Paragraph position="2"> Then, the algorithm's output is compared against the judges'; for each segment produced by the algorithm, the segment is considered to be a correct token if at least k of the n judges agree. Thus, more than one segmentation may be considered correct if we set k such that k _&lt; \[~J. If k is set to 1, it is sufficient for any judge to sanction a segment. If k = n, all the judges must agree. Under the n/c-blind method a precision rate can be given under any chosen (n, k) setting.</Paragraph>
    <Paragraph position="3"> The experiments below were conducted with 100 pairs of sentences from the corpus containing between 2,000 and 2,600 words, sampled randomly with replacement. All results reported in Figure 1 give the precision rates for n = 8 judges with all values of k between 1 and n. Note the tendency of higher values of k to reduce precision estimates. The wide variance with different k (between 30% and 90%) underscores the importance of more rigorous evaluation methodology.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
EXPERIMENT I
</SectionTitle>
    <Paragraph position="0"> Tokenizing independently derived test data. The unknown word problem is now widely recognized, but we believe its severity is still greatly underestimated. As an &amp;quot;acid test&amp;quot;, we tokenized a corpus that was derived completely independently of the dictionary that our tokenizer is based on. We used a statistical dictionary-based tokenizer designed to be representative of current tokenizing approaches, which chooses the segmentation that maximizes the product of the individual words' probabilities. The baseline dictionary used by the tokenizer is the BDC dictionary (BDC 1992), containing 89,346 unique orthographic forms. The text, drawn from the HKUST English-Chinese Parallel Bilingual Corpus (Wu 1994), consists of transcripts from the parliamentary proceedings of the Hong Kong Legislative Council. Thus, the text can be expected to contain many references to subjects outside the domains under consideration by our dictionary's lexicographers in Taiwan. Regional usage differences are also to be expected.</Paragraph>
    <Paragraph position="1"> The results (see Figure 1) show accuracy rates far below the 90-99% range which is typically reported. Visual inspection of tokenized output showed that an overwhelming majority of the errors arose from missing dictionary entries. Tokenization performance on realistic unrestricted text is still seriously compromised.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="180" type="metho">
    <SectionTitle>
EXPERIMENT II
</SectionTitle>
    <Paragraph position="0"> Tokenization with statistical lexicon augmentation. To alleviate the unknown word problem, we next experimented with augmenting the tokenizer's dictionary using CXtract, a statistical tool that finds morpheme sequences likely to be Chinese words (Fung &amp; Wu 1994). In the earlier work we found CXtract to be a good generator of previously unknown lexical entries, so overall token recall was expected to improve. However, it was not clear whether the gain would outweigh errors introduced by the illegitimate lexical entries that CXtract also produces.</Paragraph>
    <Paragraph position="1"> The training corpus consisted of approximately 2 million Chinese characters drawn from the Chinese half of our bilingual corpus. The unsupervised training procedure is described in detail in Fung &amp; Wu (1994). The training suggested 6,650 candidate lexical entries. Of these, 2,040 were already present  in the dictionary, leaving 4,610 previously unknown new entries. null The same tokenization experiment was then run, using the augmented dictionary instead. The results shown in Figure 1 bear out our hypothesis that augmenting the lexicon with CXtract's statistically generated lexical entries would improve the overall precision, reducing error rates as much as 32.0% for k = 2.</Paragraph>
  </Section>
  <Section position="6" start_page="180" end_page="180" type="metho">
    <SectionTitle>
EXPERIMENT III
</SectionTitle>
    <Paragraph position="0"> Morphosyntactic filters for lexicon candidates. CXtract produces excellent recall but we wished to improve precision further. Ideally, the false candidates should be rejected by some automatic means, without eliminating valid lexical entries. To this end, we investigated a set of 34 simple filters based on linguistic principles. Space precludes a full listing; selected filters are discussed below.</Paragraph>
    <Paragraph position="1"> Our filters can be extremely inexpensive because CXtract's statistical criteria are already tuned for high precision. The filtering process first segments the candidate using the original dictionary, to identify the component words. It then applies morphological and syntactic constraints to eliminate (a) sequences that should remain multiple segments and (b) ill-formed sequences.</Paragraph>
    <Paragraph position="2"> Morphological constraints. The morphologically-based filters reject a hypothesized lexical entry if it matches any filtering pattern. The particular characters in these filters are usually classified either as morphological affixes, or as individual words. We reject any sequence with the affix on the wrong end (the special case of the genitive fl&amp;quot;,j (de) is considered below). Because morphemes such as the plural marker ~ (m6n) or the instance marker -3k (d) are suffixes, we can eliminate candidate sequences that begin with them. Similarly, we can reject sequences that end with the ordinal prefix (di) or the preverbial durative ~ (z/d).</Paragraph>
    <Paragraph position="3"> Filtering characters cannot be used if they are polysemous or homonymous and can participate in legitimate words in other uses. For example, the durative ~i~ (zhe) is not a good filter because the same character (with varying pronunciations) can be used to mean &amp;quot;apply&amp;quot;, &amp;quot;trick&amp;quot;, or &amp;quot;touch&amp;quot;, among others.</Paragraph>
    <Paragraph position="4"> Any candidate lexical entry is filtered if it contains the genitive/associative ~ (de). This includes, for example, both ill-formed boundary-crossing patterns like ~j~ (de w6i xitin, danger of), and phrases like ~:~\]~ (xiang gang de qifm tti, Hong Kong's future) which should properly be segmented ~:h~ fl'-,J ~J~,. In addition, because the compounding process does not involve two double-character words as frequently as other patterns, such sequences were rejected.</Paragraph>
    <Paragraph position="5"> Closed-class syntactic constraints. The closed-class filters operate on two distinct principles. Sequences ending with strongly prenominal or preverbial words are rejected, as are sequences beginning with postnominals and postverbials. A majority of the filtering patterns match correct syntactic units, including prepositional, conjunctive, modal, adverbial, and verb phrases. The rationale for rejecting such sequences is that these closed-class words do not satisfy the criteria for being bound into compounds, and just co-occur with some sequences by chance because of their high frequency.</Paragraph>
    <Paragraph position="6"> Results. The same tokenization experiment was run using the filtered augmented dictionary. The filters left 5,506 candidate lexical entries out of the original 6,650, of which 3,467 were previously unknown. Figure 1 shows significantly improved precision in every measurement except for a very slight drop with k = 8, with an error rate reduction of 49.4% at k : 2. Thus any loss in token recall due to the filters is outweighed by the gain in precision. This may be taken as indirect evidence that the loss in recall is not large.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML