File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1076_intro.xml
Size: 6,336 bytes
Last Modified: 2025-10-06 14:06:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1076"> <Title>One Tokenization per Source</Title> <Section position="3" start_page="457" end_page="458" type="intro"> <SectionTitle> 2 Corpus Investigation </SectionTitle> <Paragraph position="0"> This section reports a concrete corpus investigation aimed at validating the hypothesis.</Paragraph> <Section position="1" start_page="457" end_page="457" type="sub_section"> <SectionTitle> 2.1 Data </SectionTitle> <Paragraph position="0"> The two resources used in this study are the Chinese PH corpus (Gut 1993) and the Beihang dictionary (Liu and Liang 1989). The Chinese PH corpus is a collection of about 4 million morphemes of news articles from the single source of China's Xinhua News Agency in 1990 and 1991. The Beihang dictionary is a collection of about 50,000 word-like tokens, each of which occurs at least 5 times in a balanced collection of more than 20 million Chinese characters.</Paragraph> <Paragraph position="1"> What is unique in the PH corpus is that all and only unambiguous token boundaries with respect to the Beihang dictionary have been marked. For instance, if the English character string fundsandmoney were in the PH corpus, it would be in the form of fundsand/money, since the position in between character d and m is an unambiguous token boundary with respect to normal English dictionary, but fundsand could be either funds/and or fund/sand.</Paragraph> <Paragraph position="2"> There are two types of fragments in between adjacent unambiguous token boundaries: those which are dictionary entries on the whole, and those which are not.</Paragraph> </Section> <Section position="2" start_page="457" end_page="457" type="sub_section"> <SectionTitle> 2.2 Dictionary-Entry Fragments </SectionTitle> <Paragraph position="0"> We manually tokenized in context each of the dictionary-entry fragments in the first 6,000 lines of the PH corpus. There are 6,700 different fragments which cumulatively occur 46,635 times.</Paragraph> <Paragraph position="1"> Among them, 14 fragments (Table 1, Column 1) realize different tokenizations in their 87 occurrences. 16 tokenization errors would be introduced if taking majority tokenizations only (Table 2).</Paragraph> <Paragraph position="2"> Also listed in Table 1 are the numbers of fragments tokenized as single tokens (Column 2) or as a stream of multiple tokens (Column 3). For instance, the first fragment must be tokenized as a single token for 17 times but only for once as a token-pair.</Paragraph> <Paragraph position="3"> realizing different tokenizations in the PH corpus. mJmil_lmlm mn nnu mnnn: nnu mn/ - nnu mn nE nnm lmmmlmm nmmn munmmm m R nnm In short, 0.19% of all the different dictionary-entry fragments, taking 0.21% of all the occurrences, have realized different tokenizations, and 0.03% tokenization errors would be introduced if forced to take one tokenization per fragment.</Paragraph> </Section> <Section position="3" start_page="457" end_page="457" type="sub_section"> <SectionTitle> 2.3 Non-Dictionary-Entry Fragments </SectionTitle> <Paragraph position="0"> Similarly, we identified in the PH corpus all fragments that are not entries in the Beihang dictionary, and manually tokenized each of them in context. There are 14,984 different fragments which cumulatively occur 49,308 times. Among them, only 35 fragments (Table 3) realize different tokenizations in their 137 occurrences. 39 tokenization errors would be introduced if taking majority tokenizations only (Table 4).</Paragraph> <Paragraph position="1"> In short, 0.23% of all the non-dictionary-entry fragments, taking 0.28% of all occurrences, have realized different tokenizations, and 0.08% tokenization errors would be introduced if forced to take one tokenization per fragment.</Paragraph> </Section> <Section position="4" start_page="457" end_page="458" type="sub_section"> <SectionTitle> 2.4 Tokenization Criteria </SectionTitle> <Paragraph position="0"> Some readers might question the reliability of the preceding results, because it is well-known in the literature that both the inter- and intra-judge tokenization consistencies can hardly be better than 95% but easily go worse than 70%, if the tokenization is guided solely by the intuition of human judges.</Paragraph> <Paragraph position="1"> To ensure consistency, the manual tokenization reported in this paper has been independently done twice under the following three criteria, applied in that order: (1) Dictionary Existence: The tokenization contains no non-dictionary-entry character fragment.</Paragraph> <Paragraph position="2"> (2) Structural Consistency: The tokenization has no crossing-brackets (Black, Garside and Leech 1993) with at least one correct and complete structural analysis of its underlying sentence.</Paragraph> <Paragraph position="3"> (3) Maximum Tokenization: The tokenization is a critical tokenization (Guo 1997).</Paragraph> <Paragraph position="4"> The basic idea behind is to regard sentence tokenization as a (shallow) type of (phrasestructure-like) morpho-syntactic parsing which is to assign a tree-like structure to a sentence. The tokenization of a sentence is taken to be the single-layer bracketing corresponding to the highest-possible cross-section of the sentence tree, with each bracket a token in dictionary.</Paragraph> <Paragraph position="5"> Among the three criteria, both the criterion of dictionary existence and that of maximum tokenization are well-defined without any uncertainty, as long as the tokenization dictionary is specified.</Paragraph> <Paragraph position="6"> However, the criterion of structural consistency is somewhat under-specified since the same linguistic expression may have different sentence structural analyses under different grammatical theories and/or formalisms, and it may be read differently by different people.</Paragraph> <Paragraph position="7"> Fortunately, our tokenization practice has shown that this is not a problem when all the controversial fragments are carefully identified and their tokenizations from different grammar schools are purposely categorized. Note, the emphasis here is not on producing a unique &quot;correct&quot; tokenization but on managing and minimizing tokenization inconsistencyL</Paragraph> </Section> </Section> class="xml-element"></Paper>