File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2016_metho.xml
Size: 19,260 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2016"> <Title>Japanese Dependency Analysis using Cascaded Chunking</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Probabilistic Model </SectionTitle> <Paragraph position="0"> This section describes the general formulation of the probabilistic model for parsing which has been applied to Japanese statistical dependency analysis.</Paragraph> <Paragraph position="1"> First of all, we define a sentence as a sequence of segments B = hb1;b2 :::;bmi and its syntactic structure as a sequence of dependency patterns D = hDep(1);Dep(2);:::;Dep(m!1)i , where Dep(i) = j means that the segment bi depends on (modifies) segment bj. In this framework, we assume that the dependency sequence D satisfies the following two constraints.</Paragraph> <Paragraph position="2"> 1. Japanese is a head-final language. Thus, except for the rightmost one, each segment modifies exactly one segment among the segments appearing to its right.</Paragraph> <Paragraph position="3"> 2. Dependencies do not cross one another.</Paragraph> <Paragraph position="4"> Statistical dependency analysis is defined as a searching problem for the dependency pattern D that maximizes the conditional probability P(DjB) of the input sequence under the above-mentioned constraints. If we assume that the dependency probabilities are mutually independent, P(DjB) can be rewritten as:</Paragraph> <Paragraph position="6"> P(Dep(i)=jjfij) represents the probability that bi modifies bj. fij is an n dimensional feature vector that represents various kinds of linguistic features related to the segments bi and bj.</Paragraph> <Paragraph position="7"> We obtain Dbest = argmaxD P(DjB) taking into all the combination of these probabilities. Generally, the optimal solution Dbest can be identified by using bottom-up parsing algorithm such as CYK algorithm.</Paragraph> <Paragraph position="8"> The problem in the dependency structure analysis is how to estimate the dependency probabilities accurately. A number of statistical and machine learning approaches, such as Maximum Likelihood estimation (Fujio and Matsumoto, 1998), Decision Trees (Haruno et al., 1999), Maximum Entropy models (Uchimoto et al., 1999; Uchimoto et al., 2000; Kanayama et al., 2000), and Support Vector Machines (Kudo and Matsumoto, 2000), have been applied to estimate these probabilities.</Paragraph> <Paragraph position="9"> In order to apply a machine learning algorithm to dependency analysis, we have to prepare the positive and negative examples. Usually, in a probabilistic model, all possible pairs of segments that are in a dependency relation are used as positive examples, and two segments that appear in a sentence but are not in a dependency relation are used as negative examples. Thus, a total of n.(n!1)=2 training examples (where n is the number of segments in a sentence) must be produced per sentence.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Cascaded Chunking Model </SectionTitle> <Paragraph position="0"> In the probabilistic model, we have to estimate the probabilities of each dependency relation. However, some machine learning algorithms, such as SVMs, cannot estimate these probabilities directly. Kudo and Matsumoto (2000) used the sigmoid function to obtain pseudo probabilities in SVMs. However, there is no theoretical endorsement for this heuristics. null Moreover, the probabilistic model is not good in its scalability since it usually requires a total of n.(n!1)=2 training examples per sentence. It will be hard to combine the probabilistic model with some machine learning algorithms, such as SVMs, which require a polynomial computational cost on the number of given training examples.</Paragraph> <Paragraph position="1"> In this paper, we introduce a new method for Japanese dependency analysis, which does not require the probabilities of dependencies and parses a sentence deterministically. The proposed method can be combined with any type of machine learning algorithm that has classification ability.</Paragraph> <Paragraph position="2"> The original idea of our method stems from the cascaded chucking method which has been applied in English parsing (Abney, 1991). Let us introduce the basic framework of the cascaded chunking parsing method: 1. A sequence of base phrases is the input for this algorithm.</Paragraph> <Paragraph position="3"> 2. Scanning from the beginning of the input sentence, chunk a series of base phrases into a single non-terminal node.</Paragraph> <Paragraph position="4"> 3. For each chunked phrase, leave only the head phrase, and delete all the other phrases inside the chunk 4. Finish the algorithm if a single non-terminal node remains, otherwise return to the step 2 and repeat.</Paragraph> <Paragraph position="5"> We apply this cascaded chunking parsing technique to Japanese dependency analysis. Since Japanese is a head-final language, and the chunking can be regarded as the creation of a dependency between two segments, we can simplify the process of Japanese dependency analysis as follows: 1. Put an O tag on all segments. The O tag indicates that the dependency relation of the current segment is undecided.</Paragraph> <Paragraph position="6"> 2. For each segment with an O tag, decide whether it modifies the segment on its immediate right hand side. If so, the O tag is replaced with a D tag.</Paragraph> <Paragraph position="7"> 3. Delete all segments with a D tag that are immediately followed by a segment with an O tag. 4. Terminate the algorithm if a single segment remains, otherwise return to step 2 and repeat. Figure 1 shows an example of the parsing process with the cascaded chunking model.</Paragraph> <Paragraph position="8"> The input for the model is the linguistic features related to the modifier and modifiee, and the output from the model is either of the tags (D or O). In training, the model simulates the parsing algorithm by consulting the correct answer from the training annotated corpus. During the training, positive (D) and negative (O) examples are collected. In testing, the model consults the trained system and parses the input with the cascaded chunking algorithm.</Paragraph> <Paragraph position="9"> caded chunking model We think this proposed cascaded chunking model has the following advantages compared with the traditional probabilistic models.</Paragraph> <Paragraph position="10"> + Simple and Efficient If we use the CYK algorithm, the probabilistic model requires O(n3) parsing time, (where n is the number of segments in a sentence.). On the other hand, the cascaded chunking model requires O(n2) in the worst case when all segments modify the rightmost segment. The actual parsing time is usually lower than O(n2), since most of segments modify segment on its immediate right hand side.</Paragraph> <Paragraph position="11"> Furthermore, in the cascaded chunking model, the training examples are extracted using the parsing algorithm itself. The training examples required for the cascaded chunking model is much smaller than that for the probabilistic model. The model reduces the training cost significantly and enables training using larger amounts of annotated corpus.</Paragraph> <Paragraph position="12"> + No assumption on the independence between dependency relations The probabilistic model assumes that dependency relations are independent. However, there are some cases in which one cannot parse a sentence correctly with this assumption. For example, coordinate structures cannot be always parsed with the independence constraint. The cascaded chunking model parses and estimates relations simultaneously. This means that one can use all dependency relations, which have narrower scope than that of the current focusing relation being considered, as feature sets. We describe the details in the next section.</Paragraph> <Paragraph position="13"> + Independence from machine learning algorithm null The cascaded chunking model can be combined with any machine learning algorithm that works as a binary classifier, since the cascaded chunking model parses a sentence deterministically only deciding whether or not the current segment modifies the segment on its immediate right hand side. Probabilities of dependencies are not always necessary for the cascaded chunking model.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Dynamic and Static Features </SectionTitle> <Paragraph position="0"> Linguistic features that are supposed to be effective in Japanese dependency analysis are: head words and their parts-of-speech tags, functional words and inflection forms of the words that appear at the end of segments, distance between two segments, existence of punctuation marks. As those are solely defined by the pair of segments, we refer to them as the static features.</Paragraph> <Paragraph position="1"> Japanese dependency relations are heavily constrained by such static features since the inflection forms and postpositional particles constrain the dependency relation. However, when a sentence is long and there are more than one possible dependency, static features, by themselves cannot determine the correct dependency.</Paragraph> <Paragraph position="2"> To cope with this problem, Kudo and Matsumoto (2000) introduced a new type of features called dynamic features, which are created dynamically during the parsing process. For example, if some relation is determined, this modification relation may have some influence on other dependency relation.</Paragraph> <Paragraph position="3"> Therefore, once a segment has been determined to modify another segment, such information is kept in both of the segments and is added to them as a new feature. Specifically, we take the following three types of dynamic features in our experiments.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Support Vector Machines </SectionTitle> <Paragraph position="0"> Although any kind of machine learning algorithm can be applied to the cascaded chunking model, we use Support Vector Machines (Vapnik, 1998) for our experiments because of their state-of-the-art performance and generalization ability.</Paragraph> <Paragraph position="1"> SVM is a binary linear classifier trained from the samples, each of which belongs either to positive or negative class as follows: (x1;y1);:::;(xl;yl) (xi 2 Rn; yi 2 f+1;!1g); where xi is a feature vector of the i-th sample represented by an n dimensional vector, and yi is the class (positive(+1) or negative(!1) class) label of the i-th sample. SVMs find the optimal separating hyperplane (wC/x + b) based on the maximal margin strategy. The margin can be seen as the distance between the critical examples and the separating hyperplane. We omit the details here, the maximal margin strategy can be realized by the following optimization problem:</Paragraph> <Paragraph position="3"> Furthermore, SVMs have the potential to carry out non-linear classifications. Though we leave the details to (Vapnik, 1998), the optimization problem can be rewritten into a dual form, where all feature vectors appear as their dot products. By simply substituting every dot product of xi and xj in dual form with a Kernel function K(xi;xj), SVMs can handle non-linear hypotheses. Among many kinds of Kernel functions available, we will focus on the d-th polynomial kernel: K(xi;xj) = (xi C/xj + 1)d.</Paragraph> <Paragraph position="4"> Use of d-th polynomial kernel functions allows us to build an optimal separating hyperplane which takes into account all combinations of features up to d.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiments and Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Experimental Setting </SectionTitle> <Paragraph position="0"> We used the following two annotated corpora for our experiments.</Paragraph> <Paragraph position="1"> + Standard data set This data set consists of the Kyoto University text corpus Version 2.0 (Kurohashi and Nagao, 1997). We used 7,958 sentences from the articles on January 1st to January 7th as training examples, and 1,246 sentences from the articles on January 9th as the test data. This data set was used in (Uchimoto et al., 1999; Uchimoto et al., 2000) and (Kudo and Matsumoto, 2000).</Paragraph> <Paragraph position="2"> + Large data set In order to investigate the scalability of the cascaded chunking model, we prepared larger data set. We used all 38,383 sentences of the Kyoto University text corpus Version 3.0. The training and test data were generated by a two-fold cross validation.</Paragraph> <Paragraph position="3"> The feature sets used in our experiments are shown in Table 1. The static features are basically taken from Uchimoto's list (Uchimoto et al., 1999). Head Word (HW) is the rightmost content word in the segment. Functional Word (FW) is set as follows: null - FW = the rightmost functional word, if there is a functional word in the segment - FW = the rightmost inflection form, if there is a predicate in the segment - FW = same as the HW, otherwise.</Paragraph> <Paragraph position="4"> The static features include the information on existence of brackets, question marks and punctuation marks, etc. Besides, there are features that show the relative relation of two segments, such as distance, and existence of brackets, quotation marks and punctuation marks between them.</Paragraph> <Paragraph position="5"> For a segment X and its dynamic feature Y (where Y is of type A or B), we set the Functional Representation (FR) feature of X based on the FW of X (X-FW) as follows: - FR = lexical form of X-FW if POS of X-FW is particle, adverb, adnominal or conjunction - FR = inflectional form of X-FW if X-FW has an inflectional form.</Paragraph> <Paragraph position="6"> - FR = the POS tag of X-FW, otherwise.</Paragraph> <Paragraph position="7"> For a segment X and its dynamic feature C, we set POS tag and POS-subcategory of the HW of X. All our experiments are carried out on Alpha-Sever 8400 (21164A 500Mhz) for training and Linux (PentiumIII 1GHz) for testing. We used a third degree polynomial kernel function, which is exactly the same setting in (Kudo and Matsumoto, 2000).</Paragraph> <Paragraph position="8"> Performance on the test data is measured using dependency accuracy and sentence accuracy. Dependency accuracy is the percentage of correct dependencies out of all dependency relations. Sentence accuracy is the percentage of sentences in which all dependencies are determined correctly.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Experimental Results </SectionTitle> <Paragraph position="0"> The results for the new cascaded chunking model as well as for the previous probabilistic model based on SVMs (Kudo and Matsumoto, 2000) are summarized in Table 2. We cannot employ the experiments for the probabilistic model using large dataset, since the data size is too large for our current SVMs learning program to terminate in a realistic time period.</Paragraph> <Paragraph position="1"> Even though the number of training examples used for the cascaded chunking model is less than a quarter of that for the probabilistic model, and the used feature set is the same, dependency accuracy and sentence accuracy are improved using the cascaded chunking model (89.09%! 89.29%, 46.17% ! 47.53%).</Paragraph> <Paragraph position="2"> The time required for training and parsing are significantly reduced by applying the cascaded chunking model (336h.!8h, 2.1sec.! 0.5sec.).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Probabilistic model vs. Cascaded Chunking model </SectionTitle> <Paragraph position="0"> As can be seen Table 2, the cascaded chunking model is more accurate, efficient and scalable than the probabilistic model. It is difficult to apply the probabilistic model to the large data set, since it takes no less than 336 hours (2 weeks) to carry out the experiments even with the standard data set, and SVMs require quadratic or more computational cost on the number of training examples.</Paragraph> <Paragraph position="1"> For the first impression, it may seems natural that higher accuracy is achieved with the probabilistic model, since all candidate dependency relations are used as training examples. However, the experimental results show that the cascaded chunking model performs better. Here we list what the most significant contributions are and how well the cascaded chunking model behaves compared with the probabilistic model.</Paragraph> <Paragraph position="2"> The probabilistic model is trained with all candidate pairs of segments in the training corpus. The problem of this training is that exceptional dependency relations may be used as training examples.</Paragraph> <Paragraph position="3"> For example, suppose a segment which appears to right hand side of the correct modifiee and has a similar content word, the pair with this segment becomes a negative example. However, this is negative because there is a better and correct candidate at a different point in the sentence. Therefore, this may not be a true negative example, meaning that this can be positive in other sentences. In addition, if a segment is not modified by a modifier because of cross dependency constraints but has a similar content word with correct modifiee, this relation also becomes an exception. Actually, we cannot ignore these exceptions, since most segments modify a segment on its immediate right hand side. By using all candidates of dependency relation as the training examples, we have committed to a number of exceptions which are hard to be trained upon. Looking in particular on a powerful heuristics for dependency structure analysis: &quot;A segment tends to modify a nearer segment if possible,&quot; it will be most important to train whether the current segment modifies the segment on its immediate right hand side. The cascaded chunking model is designed along with this heuristics and can remove the exceptional relations which has less potential to improve performance. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Effects of Dynamic Features </SectionTitle> <Paragraph position="0"> Figure 3 shows the relationship between the size of the training data and the parsing accuracy. This figure also shows the accuracy with and without the dynamic features. Generally, the results with the dynamic feature set is better than the results without it. The dynamic features constantly outperform static features when the size of the training data is large. In most cases, the improvements is considerable. null Table 3 summarizes the performance without some dynamic features. From these results, we can</Paragraph> </Section> </Section> class="xml-element"></Paper>