XML Viewer - p04-1074

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1074_metho.xml
Size: 16,559 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1074">
  <Title>Applying Machine Learning to Chinese Temporal Relation Resolution</Title>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3 Machine Learning Approaches for Relative
Relation Resolution
</SectionTitle>
    <Paragraph position="0"> Previous efforts in corpus-based natural language processing have incorporated machine learning methods to coordinate multiple linguistic features for example in accent restoration (Yarowsky, 1994) and event classification (Siegel and McKeown, 1998), etc.</Paragraph>
    <Paragraph position="1"> Relative relation resolution can be modeled as a relation classification task. We model the thirteen relative temporal relations (see Figure 1) as the classes to be decided by a classifier. The resolution process is to assign an event pair (i.e. the two events under concern)  to one class according to their linguistic features. For this purpose, we train two classifiers, a Probabilistic Decision Tree Classifier (PDT) and a Naive Bayesian Classifier (NBC). We then combine the results by the Collaborative Bootstrapping (CB) technique which is used to mediate the sparse data problem arose due to the limited number of training cases.</Paragraph>
    <Paragraph position="2">  It is an object in machine learning algorithms.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Probabilistic Decision Tree (PDT)
</SectionTitle>
      <Paragraph position="0"> Due to two domain-specific characteristics, we encounter some difficulties in classification. (a) Unknown values are common, for many events are modified by less than three linguistic features. (b) Both training and testing data are noisy. For this reason, it is impossible to obtain a tree which can completely classify all training examples. To overcome this predicament, we aim to obtain more adjusted probability distributions of event pairs over their possible classes. Therefore, a probabilistic decision tree approach is preferred over conventional decision tree approaches (e.g. C4.5, ID3). We adopt a non-incremental supervised learning algorithm in TDIDT (Top Down Induction of Decision Trees) family. It constructs a tree top-down and the process is guided by distributional information learned from examples (Quinlan, 1993).</Paragraph>
      <Paragraph position="1">  Based on probabilities, each object in the PDT approach can belong to a number of classes. These probabilities could be estimated from training cases with Maximum Likelihood Estimation (MLE). Let l be the decision sequence, z the object and c the class. The probability of z belonging to c is:  Objects are classified into classes based on their attributes. In the context of temporal relation resolution, how to categorize linguistic features into classification attributes is a major design issue. We extract all temporal indicators surrounding an event. Assume m and n are the anterior and posterior window size. They represent the numbers of the indicators BEFORE and AFTER respectively. Consider the most extreme case where an event consists of at most 4 temporal indicators before and 2 after. We set m and n to 4 and 2 initially. Experiments show that learning performance drops when m&gt;4 and n&gt;2 and there is only very little difference otherwise (i.e. when m[?]4 and n[?]2).</Paragraph>
      <Paragraph position="2"> In addition to temporal indicators alone, the position of the punctuation mark separating the two clauses describing the events and the classes of the events are also useful classification attributes. We will outline why this is so in Section 4.1. Altogether, the following 15 attributes are used to train the PDT and NBC classifiers:</Paragraph>
      <Paragraph position="4"> (j=1,2) are the ith indictor before and the jth indicator after the event e k (k=1,2). Given a sentence, for example, Xian /TI_d You /E0 Liao /TI_u Ma Che /n ,/w Cai /TI_d Xiu /E2 Liao /TI_u Yi Dao /n . /w, the attribute vector could be represented as: [0, 0, 0, Xian , E0, Liao , 0, 1, 0, 0, 0, Cai , E2, Liao , 0].  Many similar attribute selection functions were used to construct a decision tree (Marquez, 2000). These included information gain and information gain ratio (Quinlan, 1993),  kh Test and Symmetrical Tau (Zhou and Dillon, 1991). We adopt the one proposed by Lopez de Mantaraz (Mantaras, 1991) for it shows more stable performance than Quinlan's information gain ratio in our experiments. Compared with Quinlan's information gain ratio, Lopez's distance-based measurement is unbiased towards the attributes with a large number of values and is capable of generating smaller trees with no loss of accuracy (Marquez, Padro and Rodriguez, 2000). This characteristic makes it an ideal choice for our work, where most attributes have more than 200 values.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Naive Bayesian Classifier (NBC)
</SectionTitle>
      <Paragraph position="0"> NBC assumes independence among features.</Paragraph>
      <Paragraph position="1"> Given the class label c, NBC learns from training data the conditional probability of each attribute A</Paragraph>
      <Paragraph position="3"> (see Section 3.1.2). Classification is then performed by applying Bayes rule to compute the probability of c given the particular instance of A</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> are estimated by MLE from training data with Dirichlet Smoothing method:</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Collaborative Bootstrapping (CB)
</SectionTitle>
      <Paragraph position="0"> PDT and NB are both supervised learning approach. Thus, the training processes require many labeled cases. Recent results (Blum and Mitchell, 1998; Collins, 1999) have suggested that unlabeled data could also be used effectively to reduce the amount of labeled data by taking advantage of collaborative bootstrapping (CB) techniques. In previous works, CB trained two homogeneous classifiers based on different independent feature spaces. However, this approach is not applicable to our work since only a few temporal indicators occur in each case. Therefore, we develop an alternative CB algorithm, i.e. to train two different classifiers based on the same feature spaces. PDT (a non-linear classifier) and NBC (a linear classifier) are under consideration.</Paragraph>
      <Paragraph position="1"> This is inspired by Blum and Mitchell's theory that two collaborative classifiers should be conditionally independent so that each classifier can make its own contribution (Blum and Mitchell, 1998). The learning steps are outlined in Figure 2.</Paragraph>
      <Paragraph position="2"> Inputs: A collection of the labeled cases and unlabeled cases is prepared. The labeled cases are separated into three parts, training cases, test cases and held-out cases.</Paragraph>
      <Paragraph position="3"> Loop: While the breaking criteria is not satisfied  beled cases, and exchange with the selected cases which have higher Classification Confidence (i.e. the uncertainty is less than a threshold).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Evaluate the PDT and NBC classifiers
</SectionTitle>
    <Paragraph position="0"> with the held-out cases. If the error rate increases or its reduction is below a threshold break the loop; else go to step</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.4 Classification Confidence Measurement
</SectionTitle>
      <Paragraph position="0"> Classification confidence is the metric used to measure the correctness of each labeled case automatically (see Step 2 in Figure 2). The desirable metric should satisfy two principles: * It should be able to measure the uncertainty/ certainty of the output of the classifiers; and * It should be easy to calculate.</Paragraph>
      <Paragraph position="1"> We adopt entropy, i.e. an information theory based criterion, for this purpose. Let x be the classified object, and },...,,,{  is known, the entropy can be determined. These parameters can be easily determined in PDT, as each incoming case is classified into each class with a probability. However, the incoming cases in NBC are grouped into one class which is assigned the highest score. We then have to estimate</Paragraph>
      <Paragraph position="3"> from those scores. Without loss of generality, the probability is estimated as:</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Experiment Setup and Evaluation
</SectionTitle>
    <Paragraph position="0"> Several experiments have been designed to evaluate the proposed learning approaches and to reveal the impact of linguistic features on learning performance. 700 sentences are extracted from Ta Kong Pao (a local Hong Kong Chinese newspaper) financial version. 600 cases are labeled manually and 100 left unlabeled. Among those labeled, 400 are used as training data, 100 as test data and the rest as held-out data.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Use of Linguistic Features As Classification
Attributes
</SectionTitle>
      <Paragraph position="0"> The impact of a temporal indicator is determined by its position in a sentence. In PDT and NBC, we consider an indicator located in four positions: (1) BEFORE the first event; (2) AFTER the first event and BEFORE the second and it modifies the first event; (3) the same as (2) but it modifies the second event; and (4) AFTER the second event. Cases (2) and (3) are ambiguous. The positions of the temporal indicators are the same. But it is uncertain whether these indicators modify the first or the second event if there is no punctuation separating their roles. We introduce two methods, namely NA and SAP to check if the ambiguity affects the two learning approaches. null N(atural) O(rder): the temporal indicators between the two events are extracted and compared according to their occurrence in the sentences regardless which event they modify.</Paragraph>
      <Paragraph position="1"> S(eparate) A(uxiliary) and P(osition) words: we try to resolve the above ambiguity with the grammatical features of the indicators. In this method, we assume that an indicator modifies the first event if it is an auxiliary word (e.g. Liao ), a trend verb (e.g. Qi Lai ) or a position word (e.g. Qian ); otherwise it modifies the second event.</Paragraph>
      <Paragraph position="2"> Temporal indicators are either tense/aspect or connectives (see Section 2.2). Intuitively, it seems that classification could be better achieved if connective features are isolated from tense/ aspect features, allowing like to be compared with like. Methods SC1 and SC2 are designed based on this assumption.</Paragraph>
      <Paragraph position="3"> Table 2 shows the effect the different classification methods.</Paragraph>
      <Paragraph position="4"> SC1 (Separate Connecting words 1): it separates conjunctions and verbs relating to causality from others. They are assumed to contribute to discourse structure (intra- or inter-sentence structure), and the others contribute to the tense/aspect expressions for each individual event. They are built into 2 separate attributes, one for each event.</Paragraph>
      <Paragraph position="5"> SC2 (Separate Connecting words 2): it is the same as SC1 except that it combines the connecting word pairs (i.e. as a single pattern) into one attribute. null EC (Event Class): it takes event classes into consideration. null</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Impact of Individual Features
</SectionTitle>
      <Paragraph position="0"> From linguistic perspectives, 13 features (see Table 1) are useful for relative relation resolution. To examine the impact of each individual feature, we feed a single linguistic feature to the PDT learning algorithm one at a time and study the accuracy of the resultant classifier. The experimental results are given in Table 3. It shows that event classes have greatest accuracy, followed by conjunctions in the second place, and adverbs in the third.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.3 Discussions
</SectionTitle>
      <Paragraph position="0"> Analysis of the results in Tables 2 and 3 reveals some linguistic insights: 1. In a situation where temporal indicators appear between two events and there is no punctuation mark separating them, POS information help reduce the ambiguity. Compared with NO, SAP shows a slight improvement from 82% to 82.2%.</Paragraph>
      <Paragraph position="1"> But the improvement seems trivial and is not as good as our prediction. This might due to the small percent of such cases in the corpus.</Paragraph>
      <Paragraph position="2"> 2. Separating conjunctions and verbs relating to causality from others is ineffective. This reveals the complexity of Chinese in connecting expressions. It is because other words (such as adverbs, proposition and position words) also serve such a function. Meanwhile, experiments based on SC1 and SC2 suggest that the connecting expressions generally involve more than one word or phrase. Although the words in a connecting expression are separated in a sentence, the action is indeed interactive. It would be more useful to regard them as one attribute.</Paragraph>
      <Paragraph position="3"> 3. The effect of event classification is striking. Taking this feature into account, the accuracies of both PDT and NB improved significantly. As a matter of fact, different event classes may introduce different relations even if they are constrained by the same temporal indicators.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.4 Collaborative Bootstrapping
</SectionTitle>
      <Paragraph position="0"> Table 4 presents the evaluation results of the four different classification approaches. DM is the default model, which classifies all incoming cases as the most likely class. It is used as evaluation baseline.</Paragraph>
      <Paragraph position="1"> Compare with DM, PDT and NBC show improvement in accuracy (i.e. above 60% improvement).</Paragraph>
      <Paragraph position="2"> And CB in turn outperforms PDT and NBC. This proves that using unlabeled data to boost the performance of the two classifiers is effective.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> Relative temporal relation resolution received growing attentions in recent years. It is important for many natural language processing applications, such as information extraction and machine translation.</Paragraph>
    <Paragraph position="1"> This topic, however, has not been well studied, especially in Chinese. In this paper, we propose a model for relative temporal relation resolution in Chinese. Our model combines linguistic knowledge and machine learning approaches. Two learning approaches, namely probabilistic decision tree (PDT) and naive Bayesian classifier (NBC) and 13 linguistic features are employed. Due to the limited labeled cases, we also propose a collaborative bootstrapping technique to improve learning performance. The experimental results show that our approaches are encouraging. To our knowledge, this is the first attempt of collaborative bootstrapping, which involves two heterogeneous classifiers, in NLP application.</Paragraph>
    <Paragraph position="2"> This lays down the main contribution of our research.</Paragraph>
    <Paragraph position="3"> In this pilot work, temporal indicators are selected based on linguistic knowledge. It is time-consuming and could be error-prone. This suggests two directions for future studies. We will try to automate or at least semi-automate feature selection process. Another future work worth investigating is temporal indicator clustering. There are two methods we could investigate, i.e. clustering the recognized indicators which occur in training corpus according to co-occurrence information or grouping them into two semantic roles, one related to tense/aspect expressions and the other to connecting expressions between two events.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML