File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/c96-2184_abstr.xml

Size: 6,603 bytes

Last Modified: 2025-10-06 13:48:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2184">
  <Title>Segmentation Standard for Chinese Natural Language Processing</Title>
  <Section position="1" start_page="0" end_page="1045" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper proposes a segmentation standard for Chinese natural language processing. The standard is proposed to achieve linguistic felicity, computational feasibility, and data uniformity. Linguistic felicity is maintained by defining a segmentation unit to be equivalent to the theoretical definition of word, and by providing a set of segmentation principles that are equivalent to a functional definition of a word. Computational feasibility is ensured by the fact that the above functional definitions are procedural in nature and can be converted to segmentation algorithms, as well as by the implementable heuristic guidelines which deal with specific linguistic categories. Data uniformity is achieved by stratification of the standard itself and by defining a standard lexicon as part of the segmentation standard.</Paragraph>
    <Paragraph position="1"> I. Introduction One important feature of Chinese texts is that they are character-based, not wordbased. Each Chinese character stands for one phonological syllable and in most cases represents a morpheme. The fact that Chinese writing does not mark word boundaries poses the unique question of word segmentation in Chinese computational linguistics (e.g. Sproat and Shih 1990, and Chert and Liu 1992).</Paragraph>
    <Paragraph position="2"> Since words are the linguistically significant basic elements that are entered in the lexicon and manipulated by grammar rules, no language processing can be done unless words are identified. In theoretical terms, the primacy of the concept of word can be more firmly established if its existence can be empirically supported in a language that does not mark it conventionally in texts (e.g. Bates et al. 1993, Huang et al. 1993). In computational terms, no serious Chinese language processing can be done without segmentation. No efficient sharing of electronic resources or computational tools is possible unless segmentation can be standardized.</Paragraph>
    <Paragraph position="3"> Evaluation, and thus comparisons and improvements, are also impossible in Chinese computational linguistics without standardized segmentation.</Paragraph>
    <Paragraph position="4"> Since the proposed segmentation standard is intended for Chinese natural language processing, it is very important that it reflects linguistic reality as well as computational applicability. Hence we stipulate that the proposed standard must be linguistically felicitous, computationally feasible, and must ensure data uniformity.</Paragraph>
    <Paragraph position="5"> 1.1.Components of the Sezmentation Standard Our proposed segmentation standard consists of two major components to meet the goals discussed above. The modularization of the components will facilitate revisions and maintenance in the future. The two major components of the segmentation standards are the segmentation criteria and the (standard) lexicon. The tripartite segmentation criteria consist of a definition of the segmentation unit, two segmentation principles, and a set of heuristic guidelines. The segmentation lexicon contains a list of Mandarin Chinese words and other linguistic units that the heuristic guidelines must refer to.</Paragraph>
    <Paragraph position="6">  words as 'the smallest units of speech that can meaningfully stand by their own,' they are natural units for segmentation in language processing. However, as Chao (1968) observes, sociological words and linguistic words very often do not match up. In English, a sociological word can be defined by the delimitation of blanks in writing. It is nevertheless possible for a linguistic word such as a compound to be composed of more than one sociological words, such as 'the White House.' Since these cases represent only a relatively small portion of English  texts, sociological words are taken as the default standard for segmentation units as well as a reasonable approximation to linguistic words in English language processing. Chinese, on the other hand, defines its sociological words in terms of characters, in spite of the fact that grammatical words may be made up of one or more characters. In fact, one-character words represent slightly less than 10% of all lexical entries, while two-character words take up more than 65%.</Paragraph>
    <Paragraph position="7"> Similarly, one-character words are estimated to take up only 50% of all texts in Chinese (Chen et al., 1993). Since the notion of the one-word-per-character sociological word is not a good working hypothesis for linguistic words, and since there is no fixed length for words, a crucial issue is whether the notion of linguistic words can be directly used as the standard for segmentation unit.</Paragraph>
    <Paragraph position="8"> Computational linguistic works suggest that linguistic words are not the perfect units for natural language processing. For instance, the necessity for lemmatization attests to the fact that some linguistically dependent units may have independent grammatical function and meaning and need to be treated as basic units in language processing (e.g. Sproat 1992). We follow the above findings and define the standard segmentation unit as a close approximation of linguistic words with emphasis on functional rather than phonological or morphological independency.</Paragraph>
    <Paragraph position="9"> 1) Segmentation Unitde f is the smallest string of character(s) that has both an independent meaning and a fixed grammatical category.</Paragraph>
    <Paragraph position="10"> There are two points worth remarking involving the above definition. First, non-technical terms are deliberately chosen such that even developers in information industries with little or no linguistic background could follow this standard. Second, it follows from this definition that many of the so-called particles, which show various levels of linguistic dependencies but represent invariant grammatical functions, will be treated as segmentation units. They include le 'perfective marker', and de 'relative clause marker'. II, 2, Segmentatign Principles We propose two segmentation principles to define the two basic concepts underlining the definition: independent meaning and fixed grammatical category. The principles also provide a functional/procedural algorithm for identifying segmentation units.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML