File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1806_metho.xml
Size: 5,924 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1806"> <Title>PCFG Parsing for Restricted Classical Chinese Texts</Title> <Section position="3" start_page="10" end_page="10" type="metho"> <SectionTitle> 2 PCFG Model and Classical Chinese Grammar </SectionTitle> <Paragraph position="0"> In this section we will cover the PCFG model and context-sensitive rules designed for Classical Chinese. Features of the rule-set will be also discussed.</Paragraph> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 2.1 PCFG Model and Rule Restriction </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> start non-terminal, and R is the finite set of rules, which are pairs from The advantage of binary/unary rules lies in the simplicity of parsing algorithm, and will be discussed in Section 4.</Paragraph> <Paragraph position="3"> The major difference between our model and CNF is that for unary rules, we do not require the right-hand-side to be terminals. And this enables us easier representation of the Classical Chinese language.</Paragraph> </Section> <Section position="2" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 2.2 Rule-Set for Classical Chinese </SectionTitle> <Paragraph position="0"> An important advantage of PCFG is that it needs fewer rules and parameters. According to our corpus, which is representative of Classical Chinese classics, only 100-150 rules would be sufficient. This is mainly because our rule set is linguistically sound. A summary of the set of rules is presented as follows.</Paragraph> <Paragraph position="1"> Table 2. Our non-terminals (also called syntactic tagset, or constituent set) A subset of most frequently used rules is shown in the following table.</Paragraph> </Section> <Section position="3" start_page="10" end_page="10" type="sub_section"> <SectionTitle> Classical Chinese </SectionTitle> <Paragraph position="0"> 1. S -> NP VP ; simple S/V 2. S -> VP ; S omitted 3. S -> VP NP ; S/V inversion 4. S -> ad S 5. VP -> vi 6. VP -> vt NP ; simple V/O 7. VP -> NP vt ; V/O inversion 8. VP -> ad VP 9. VP -> PP VP ; prepositioned PP 10. VP -> VP PP ; postpositioned PP 11. VP -> NP ; NP as VP 12. VP -> VP yq 13. NP -> n 14. NP -> npron 15. NP -> ADJP NP 16. NP -> POSTADJP 17. NP -> VP ; V/O as NP 18. NP -> fy NP 19. ADJP -> aa 20. ADJP -> apron 21. ADJP -> NP zd 22. PP -> prep NP ; P+NP 23. PP -> NP prep ; inversion 24. PP -> prepb ; object omitted 25. PP -> NP ; prep. omitted 26. POSTADJP-> VP zj Examples of parse trees are shown in the following figure.</Paragraph> </Section> <Section position="4" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 2.3 Features of Classical Chinese Grammar Rules </SectionTitle> <Paragraph position="0"> As an aside, it is worthwhile to point out here some peculiarities of the Classical Chinese grammar used in our work. Readers not interested in grammar modeling may simply skip this subsection. As mentioned before, the grammar of Classical Chinese is entirely different from that of English, so a few special features must be studied. Although these features bring many difficulties to the parser, we have developed successful programming techniques to solve them.</Paragraph> <Paragraph position="1"> From the rule-set, the reader might find that two special grammatical structures is very common in Classical Chinese: 1. Inversion: subject/verb inversion (rule 3), preposition/object inversion (rule 23).</Paragraph> <Paragraph position="2"> 2. Omission: Subject omitted (rule 2), preposition's object omitted (rule 24), preposition omitted (rule 25).</Paragraph> <Paragraph position="3"> Maybe the strangest feature is the structure of PP. English PP is always P+NP. But here in Classical Chinese, by inversion and omission, the PP may have up to 4 forms, as shown in preposition is in brackets, and [] indicate an omission. Another feature that must be pointed out here is the cycle. In our rule-set, there are 2 rules It will ease our parsing because Classical Chinese is lexically and syntactically very ambiguous. An NP can act as a VP (a main verb), while a VP can act as a NP (subject or object). These two features are exemplified in figure 3. There are actually more cycles in the rule-set. Helpful as they are, the cycles bring great difficulty to the memory-based top-down parser. In practice, we develop a closure-based method to solve this problem, as shown in the following pseudo-code:</Paragraph> <Paragraph position="5"> Another point is the use of preferences for ambiguity resolution. While the ambiguities in our rule-set greatly ease our modeling Classical Chinese grammar, it causes the parser to make a lot of ridiculous errors. So we here apply some predefined preferences such as 'an fy must be at the first of an NP' and 'a yq must be at the end of a VP'. This consideration results in a significant increase in the parsing accuracies.</Paragraph> </Section> </Section> <Section position="4" start_page="10" end_page="10" type="metho"> <SectionTitle> 3 Evaluations </SectionTitle> <Paragraph position="0"> In our preliminary experiments, we constructed a treebank of 1000 manually parsed sentences (quite large for Classical Chinese treebank), in which 100 sentences are selected as the test set using the cross-validation scheme, while the others as the learning set. The majority of these sentences are extracted from classics of pre-Tsin Classical Chinese such as Hanfeizi and Xunzi because in these texts there are fewer proper nouns and difficult words. That is the restriction we put on the selection of Classical Chinese texts. It must be pointed out here that compared from other languages, Classical Chinese sentences are so short that the average length is only about 4-6 words</Paragraph> </Section> class="xml-element"></Paper>