File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1206_metho.xml
Size: 14,981 bytes
Last Modified: 2025-10-06 14:07:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1206"> <Title>Enhancement of a Chinese Discourse Marker Tagger with C4.5</Title> <Section position="3" start_page="0" end_page="38" type="metho"> <SectionTitle> 2 Manual Tagging Process </SectionTitle> <Paragraph position="0"> To tag the discourse markers, the following coding scheme is designed to encode Real Discourse Markers (RDM) appearing in the SIFAS corpus (T'sou et al. 1998). We describe the z ~h discourse marker with a 7-tuple RDM; RDMi=< DM i, RRi, RPi, CTi, MNi, RNi, OT i >, where</Paragraph> <Paragraph position="2"> the lexical item of the Discourse Marker, or the value'NULL'.</Paragraph> <Paragraph position="3"> the Rhetorical Relation in which DIVI~ is a constituent marker.</Paragraph> <Paragraph position="4"> the Relative Position of DMi. the Connection Type of RRi. the Discourse Marker Sequence Number.</Paragraph> <Paragraph position="5"> the Rhetorical Relation Sequence Number.</Paragraph> <Paragraph position="6"> the Order Type of RR~. The value of OTi can be 1, -1 or 0, denoting respectively the normal order, reverse order or irrelevance of the premiseconsequence ordering of RR i . For apparent discourse markers that do not function as a real discourse marker in a text, a different coding scheme is used to encode them. We describe the i th apparent discourse marker using a 3-Tuple ADM~: ADMi =< LIi, *, SNi >, where LIi :the Lexical Item of the apparent discourse marker.</Paragraph> <Paragraph position="7"> SNi : the Sequence Number of the apparent discourse marker.</Paragraph> <Paragraph position="8"> In Chinese, discourse markers can be either words or phrases. To tag the SIFAS corpus, all discourse markers are organized into a discourse marker pair-rhetorical relation correspondence table. Part of the table is shown Table 1.</Paragraph> <Paragraph position="9"> To construct an automatic tagging system, let us first examine the sequential steps in the tagging process of a human tagger.</Paragraph> <Paragraph position="10"> S1. Written Chinese consists of rurming texts without word delimiters; the first step is is to segment the text into Chinese word sequences.</Paragraph> <Paragraph position="11"> $2. On the basis of a discourse marker list, we identify those words in the text which appear on the list as Candidate Discourse Markers (CDMs).</Paragraph> <Paragraph position="12"> $3. To winnow Real Discourse Markers (RDMs) and Apparent Discourse Markers (ADMs) from the CDMs, and encode the ADMs with a 3-tuple.</Paragraph> <Paragraph position="13"> $4. To encode the RDM with a 7-tuple according to a Discourse Marker Pair-</Paragraph> </Section> <Section position="4" start_page="38" end_page="39" type="metho"> <SectionTitle> 3 Automatic Tagging Process </SectionTitle> <Paragraph position="0"> The identification of candidate discourse markers is based on a discourse marker list, which now contains 306 discourse markers plus a NULL marker. The markers are extracted from newspaper editorials of Hong Kong, Mainland China, Taiwan and Singapore. These markers constitute 480 distinct discontinuous pairs that correspond to 25 rhetorical relations. In actual usage, some discourse marker pairs designate multiple rhetorical relations according to context. Some pairs can represent both INTER-sentence and INTRA-sentence relations. Thus the correspondence between the discourse marker pairs and the rhetorical relations is not single-valued. Some discourse marker pairs correspond to more than one rhetorical relation or connection type. We have 504 correspondences between the discourse marker pairs and the rhetorical relations.</Paragraph> <Paragraph position="1"> In practice, one discontinuous constituent member of a marker pair is often omitted. We use the NULL marker to indicate the omission. In the 504 correspondences, 244 of them are double constituent marker pairs, 260 are single constituent markers (i.e. One of the markers is NULL). And in the 244 double constituent markers, only 3 are not single-valued correspondences (one of&quot; which is an INTER/INTRA relation, and can easily be distinguished.). Thus the tagging of the 244 double constituent markers is basically a table searching process. But for the 260 single constituent markers, the identity of the NULL marker is often difficult to determine. The SIFAS tagging system works in two modes: automatic and interactive (semiautomatic). The automatic tagging procedure is as follows: 1. Data preparation: Input data files are modified according to the required format.</Paragraph> <Paragraph position="2"> 2. Word segmentation: Because there are no delimiters between Chinese words in a text, words have to be extracted through a segmentation process.</Paragraph> <Paragraph position="3"> 3. CDM identification 4. Full-Marker RDM recognition 5. ADM identification (first pass, deterministic) 6. CDM feature extraction 7. ADM identification (2nd pass, via ML) 8. Tagging NuLL-marker CDM pairs (via ML) 9.ADM and RDM sequencing, proof null reading, training data generation, and statistics The following principles are adopted by the tagging algorithm to resolve ambiguity in the process of matching discontinuous discourse markers: 1. the principle of greediness: When matching a pair of discourse markers for a rhetorical relation, priority is given to the first matched relation from the left. 2.the principle of locality: When matching a pair of discourse markers for a rhetorical relation, priority is given to the relation where the distance between its constituent markers is shortest.</Paragraph> <Paragraph position="4"> 3.the principle of explicitness: When matching a pair of discourse markers for a rhetorical relation, priority is given to the relation where both markers are explicitly presented.</Paragraph> <Paragraph position="5"> 4. the principle of superiority: When matching a pair of discourse markers for a rhetorical relation, priority is given to the inter-sentence relation whose back discourse marker matched with the first word of a sentence.</Paragraph> <Paragraph position="6"> 5. the principle of Back-marker preference: This is applicable only to rhetorical relations where either the front or the back marker is absent, or to a NULL marker. In such cases, priority is given to the relation with the back marker present.</Paragraph> <Paragraph position="7"> Steps 1 to 6 and the five principles underlie the original naive tagger of the SIFAS system (T'sou et al. 1998), which also contains the system framework.</Paragraph> </Section> <Section position="5" start_page="39" end_page="42" type="metho"> <SectionTitle> 4 Improvement </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="39" end_page="40" type="sub_section"> <SectionTitle> 4.1 Problems </SectionTitle> <Paragraph position="0"> Many Chinese discourse markers have both discourse senses and alternate sentential senses in different context. For a human tagger, steps $3 and $4 in section 2 are not difficult because he/she can identify an ADM/RDM based on his/her text comprehension. However, for an automatic process, it is quite difficult to distinguish an ADM from an RDM if no syntactic/semantic information is available.</Paragraph> <Paragraph position="1"> Another problem is the location of NULL-Marker described above. Our earlier statistics showed some characteristics in the distance measured by punctuation marks.</Paragraph> <Paragraph position="2"> Statistics from 80 tagged editorials show that most of the relations are INTRA-Sentence relations (about 93%), about 70% of the INTRA RDM pairs have NULL markers.</Paragraph> <Paragraph position="3"> Most of these RDM pairs are separated by ONE comma (62%). These statistics show the importance of the problems of positioning the NULL markers.</Paragraph> <Paragraph position="4"> The naive tagger partially solved the CDM discrimination and NULL marker location problems. Our experiment shows that about 45% of the ADMs can be correctly identified, and about 60% of the NULL markers can be correctly located one comma/period away from the current RDM.</Paragraph> <Paragraph position="5"> This leaves much room for improvement.</Paragraph> <Paragraph position="6"> One solution is to add a few rules according to previous statistics. The original naive tagger did not assume any knowledge of the statistics and behavioral patterns of discourse markers. From the error analysis, we extracted some additional rules to guide the classification and matching of the discourse markers. For example, one of the rules we extracted is: &quot;A matching pair must be separated by at least two words or by punctuation marks&quot;. Using this rule, the following full marker matching error is avoided.</Paragraph> <Paragraph position="8"> Another solution is to use * syntactic/semantic information through machine learning.</Paragraph> </Section> <Section position="2" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 4.2 C4.5 </SectionTitle> <Paragraph position="0"> Most empirical learning systems are given a set of pre-classified cases, each described by a vector of attribute values, and construct from them a mapping from attribute values to classes. C4.5 is one such system that learns decision-tree classifiers. It uses a divide-and-conquer approach to growing decision trees. The current version of C4.5 is C5.0 for Unix and See5 for Windows.</Paragraph> <Paragraph position="1"> Let attributes be denoted A={a~, a2, ..., a,,J, cases be denoted D={d 1, d2, ..., d J, and classes be denoted C={c, c 2, ..., cJ. For a set of cases D, a test 1q is a split of D based on attribute at. It splits D into mutually exclusive subsets D~, D 2, ..., D r These subsets of cases are single-class collections of cases.</Paragraph> <Paragraph position="2"> If a test T is chosen, the decision tree for D consists of a node identifying the test T, and one branch for each possible subset D~. For each subset D~, a new test is then chosen for further split. If D~ satisfies a stopping criterion, the tree for Dr is a leaf associated with the most frequent class in D~. One reason for stopping is that cases in D~ belong to one class.</Paragraph> <Paragraph position="3"> C4.5 uses arg max(gain(D,1)) or arg max(gain ratio(D,T)) to choose tests for split:</Paragraph> <Paragraph position="5"> where, p(c~,D) denotes the proportion of cases in D that belong to the i th class.</Paragraph> </Section> <Section position="3" start_page="40" end_page="41" type="sub_section"> <SectionTitle> 4.3 Application of C4.5 </SectionTitle> <Paragraph position="0"> Since using semantic information requires a comprehensive thesaurus, which is unavailable at present, we only use syntactic information through machine learning.</Paragraph> <Paragraph position="1"> The attributes used in the original SIFAS system include the candidate discourse marker itself, two words immediately to the left of the CDM, and two words immediately to the right of the CDM.</Paragraph> <Paragraph position="2"> The attribute names are F2, F1, CDM, B1, B2, respectively (T'sou et al, 1999). SIFAS only uses the Part Of Speech attribute of the neighboring words. This reflects to some degree the syntactic characteristics of the CDM.</Paragraph> <Paragraph position="3"> To reflect the distance characteristics, we add two other attributes: the number of discourse delimiters (commas, semicolons for INTRA-sentence relation, periods and exclamation marks for INTER-sentence relation) before and after the current CDM, denoted Fcom and Boom, respectively. For the location of the NULL marker, we still add an actual number of delirniters Acorn. The order of these attributes is: CDM, F1, F2, B1, B2, Fcom, Boom Acorn for Null marker location, and CDM, F1, F2, B1, B2, Fcom, Bcom, IsRDM for CDM classification, where IsRDM is a Boolean value.</Paragraph> <Paragraph position="4"> The following are two examples of cases: 9~: _N. ,?,q,a,a,7,1,1 for NULL marker</Paragraph> <Paragraph position="6"> where &quot;?&quot; denotes that no corresponding word is at the position (beginning or end of sentence); a, d, q, and u are part-of-speech symbols in our segmentation dictionary, representing adjective, adverb, classifier, and auxiliary, respectively.</Paragraph> <Paragraph position="7"> The following are two examples of the rules generated by the C4.5. The first is a CDM classification rule, and the other is a which can be explained as: if the second word after the RDM is a preposition, and there is more then one commas before the current RDM, then the location of the NULL marker is two commas away from the RDM.</Paragraph> </Section> <Section position="4" start_page="41" end_page="42" type="sub_section"> <SectionTitle> 4.4 Objects in the SIFAS system </SectionTitle> <Paragraph position="0"> The objects in the new SIFAS tagging system are listed below.</Paragraph> <Paragraph position="1"> 1. Dictionary Editor: for the update of word segmentation dictionary and the rhetorical relation table.</Paragraph> <Paragraph position="2"> 2. Data Manager: for the modification of the input data (editorial texts) to conform with the required format. 3. Word Segmenter: for the segmentation of the original texts, and the recognition of CDMs.</Paragraph> <Paragraph position="3"> 4. RDM Tagger: The initial identification of RDMs is a table searching process. All those full-marker pairs are identified as rhetorical relations according to the principles described above. For those Null-marker pairs, the location of the Null maker is left to the rule interpreter. 5. ADM Tagger: The identification of ADMs is also a table searching process, because, without other syntactic/semantic information, the only way to identify ADMs from the CDMs is to find out that the CDM cannot form a valid pair with any other CDMs (including the NULL marker) to correspond to a rhetorical relation. 6. CDM Feature Extractor: For those untagged CDMs, the classification is carried out through C4.5. The Feature Extractor extracts syntactic information about the current CDM and send it to the Rule Interpreter (see below).</Paragraph> <Paragraph position="4"> 7. Rule Interpreter: C4.5 takes feature data file as the input to construct a classifier, and the rules formed are stored in an output file. The rule interpreter reads this output file and applies the rules to classify the CDMs. In our system, The Rule Interpreter functions as a NULL Marker Locator and a CDM classifier. 8. Sequencer: for the rearrangement of RDM and ADM order number. In the rearranging process, the Sequencer also extracts statistical information for analysis.</Paragraph> <Paragraph position="5"> 9. Interaction Recorder: for the recording of user interaction information for statistics use.</Paragraph> <Paragraph position="6"> 10. Data Retriever: for data retrieval and browsing.</Paragraph> </Section> </Section> class="xml-element"></Paper>