File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1605_intro.xml
Size: 4,656 bytes
Last Modified: 2025-10-06 14:01:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1605"> <Title>Improving Translation Quality of Rule-based Machine Translation</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Applying Machine Learning Technique </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Context Information </SectionTitle> <Paragraph position="0"> There are many kinds of context information that useful to decide the appropriate meaning of a word such as grammatical rules, collocation words, context words, semantic concept and etc.</Paragraph> <Paragraph position="1"> Context information is derived from a rule-base machine translation. Words and their part-of-speech tags are the simplest information, which are produced from English analysis module. In this paper, we use words and/or part-of-speech tags around a target word in deciding a word meaning.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Machine Learning </SectionTitle> <Paragraph position="0"> In this section, we will briefly descript three machine leaning techniques, C4.5, C4.5rule and RIPPER.</Paragraph> <Paragraph position="1"> C4.5, decision tree, is a traditional classifying technique that proposed by Quinlan [7]. C4.5 have been successfully applied in many NLP problems such as word extraction [9] and sentence boundary disambiguation [2]. So in this paper, we employ C4.5 in our experiments. The induction algorithm proceeds by evaluation content of series of attributes and iteratively building a tree from the attribute values with the leaves of the decision tree being the valued of the goal attribute. At each step of learning procedure, the evolving tree is branched on the attribute that partitions the data items with the highest information gain. Branches will be added until all items in the training set are classified. To reduce the effect of overfitting, C4.5 prunes the entire decision tree constructed. It recursively examines each subtree to determine whether replacing it with a leaf or branch would reduce expected error rate. This pruning makes the decision tree better in dealing with the data different from training data.</Paragraph> <Paragraph position="2"> In C4.5 version 8, it provides the other technique, which is extended from C4.5 called C4.5rule. C4.5rule extracts production rules from an unpruned decision tree produced by C4.5, and then improves process by greedily deletes or adds single rules in an effort to reduce description length. So in this paper we also employ both techniques of C4.5 and C4.5rule.</Paragraph> <Paragraph position="3"> RIPPER [10] is the one of the famous machine learning techniques applying in NLP problems [4], which was proprosed by William W.</Paragraph> <Paragraph position="4"> Cohen. On his experiment [10] shows that RIPPER is more efficient than C4.5 on noisy data and it scales nearly linearly with the number of examples in a dataset. So we decide to choose RIPPER in evaluating and comparing results with C4.5 and C4.5rule.</Paragraph> <Paragraph position="5"> RIPPER is a propositional rule learning algorithm that constructs a ruleset which classifies the training data [11]. A rule in the constructed ruleset is represented in the form of is a target class to be learned; it can be a positive or negative class. A condition T</Paragraph> <Paragraph position="7"> for a particular value of an attribute, and it takes one of four forms: A In this section, we will describe the process of our system in Figure 2. First, input a source sentence into rule-based MT and then use syntax and semantic rules for analysing the sentence. At this step, rule-based MT gives various kinds of word information. In this experiment we used only words and part-of-speech tags. After analysing, rule-based MT generates a sentence into target language. Next, the translated sentence from rule-based MT and the context information are parsed into machine learning. Machine learning requires a rule set or a decision tree, which are generated from a training set, to decide what is the appropriate meaning of a word.</Paragraph> <Paragraph position="8"> In training module (Figure 3), we parse English sentences with part-of-speech tags, which are given by ParSit, and assign the correct meaning by linguists into machine learning module. The machine learning will learn and produces a rule set or a decision tree for disambiguating word meaning. The process of training is shown in Figure 3.</Paragraph> </Section> </Section> class="xml-element"></Paper>