File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1806_abstr.xml
Size: 4,579 bytes
Last Modified: 2025-10-06 13:42:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1806"> <Title>PCFG Parsing for Restricted Classical Chinese Texts</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The Probabilistic Context-Free Grammar (PCFG) model is widely used for parsing natural languages, including Modern Chinese. But for Classical Chinese, the computer processing is just commencing.</Paragraph> <Paragraph position="1"> Our previous study on the part-of-speech (POS) tagging of Classical Chinese is a pioneering work in this area. Now in this paper, we move on to the PCFG parsing of Classical Chinese texts. We continue to use the same tagset and corpus as our previous study, and apply the bigram-based forward-backward algorithm to obtain the context-dependent probabilities. Then for the PCFG model, we restrict the rewriting rules to be binary/unary rules, which will simplify our programming. A small-sized rule-set was developed that could account for the grammatical phenomena occurred in the corpus. The restriction of texts lies in the limitation on the amount of proper nouns and difficult characters. In our preliminary experiments, the parser gives a promising accuracy of 82.3%.</Paragraph> <Paragraph position="2"> Introduction Classical Chinese is an essentially different language from Modern Chinese, especially in syntax and morphology. While there has been a number of works on Modern Chinese Processing over the past decade (Yao and Lua, 1998a), Classical Chinese is largely neglected, mainly because of its obsolete and difficult grammar patterns. In our previous work (2002), however, we have stated that in terms of computer processing, Classical Chinese is even easier as there is no need of word segmentation, an inevitable obstacle in the processing of Modern Chinese texts. Now in this paper, we move on to the parsing of Classical Chinese by PCFG model. In this section, we will first briefly review related works, then provide the background of Classical Chinese processing, and finally give the outline of the rest of the paper.</Paragraph> <Paragraph position="3"> A number of parsing methods have been developed in the past few decades. They can be roughly classified into two categories: rule-based approaches and statistical approaches. Typical rule-based approaches as described in James (1995) are driven by grammar rules. Statistical approaches such as Yao and Lua (1998a), Klein and Manning (2001) and Johnson, M. (2001), on the other hand, learn the parameters the distributional regularities from a usually large-sized corpus. In recent years, the statistical approaches have been more successful both in part-of-speech tagging and parsing. In this paper, we apply the PCFG parsing with context-dependent probabilities.</Paragraph> <Paragraph position="4"> A special difficulty lies in the word segmentation for Modern Chinese processing.</Paragraph> <Paragraph position="5"> Unlike Indo-European languages, Modern Chinese words are written without white spaces indicating the gaps between two adjacent words. And different possible segmentations may cause consistently different meanings. In this sense, Modern Chinese is much more ambiguous than those Indo-European Languages and thus more difficult to process automatically (Huang et al., 2002).</Paragraph> <Paragraph position="6"> For Classical Chinese processing, such segmentation is largely unnecessary, since most Classical Chinese words are single-syllable and single-character formed. To this end, it is easier than Modern Chinese but actually Classical Chinese is even more ambiguous because more than half of the words have two or more possible lexical categories and dynamic shifts of lexical categories are the most common grammatical phenomena in Classical Chinese. Despite of these difficulties, our work (2002) on part-of-speech tagging has shown an encouraging result.</Paragraph> <Paragraph position="7"> The rest of the paper is organized as follows. In Section 1, a tagset designed specially for Classical Chinese is introduced and the forward-backward algorithm for obtaining the context-dependent probabilities briefly discussed. We will briefly present the traditional two-level PCFG model, the syntactic tagset and CFG rule-set for Classical Chinese in Section 2. Features of the Classical Chinese grammar will also be covered in this section. In Section 3 we will present our experimental results. A summary of the paper is given in the conclusion section.</Paragraph> </Section> class="xml-element"></Paper>