File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1605_metho.xml

Size: 4,070 bytes

Last Modified: 2025-10-06 14:08:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1605">
  <Title>Improving Translation Quality of Rule-based Machine Translation</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Preliminary Experiments &amp; Results.
</SectionTitle>
    <Paragraph position="0"> To evaluate our approach, we should test on a word, which frequently occurred in normal text and has several meanings. According to the statistics of word usage from 100M-word British National Corpus, verb-to-be occurred more than thee million times, and translation of verb-to-be into Thai is quite difficult by using only linguistic rules. Therefore our experiment, we test our approach on verb-to-be.</Paragraph>
    <Paragraph position="1">  is a continuous variable and th is some value for A c that occurs in the training data; or A s  is a set-value attribute and v is a value that is an element of A s . In fact, a condition can include negation. A set-valued attribute is an attribute whose value is a set of strings. The primitive tests on a set-valued</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
attribute A
</SectionTitle>
      <Paragraph position="0"> s are of the form &amp;quot;v [?] A s &amp;quot;. When constructing a rule, RIPPER finds the test that maximizes information gain for a set of examples S efficiently, making only a single pass over S for each attribute. All symbols v, that appear as elements of attribute A for some training examples, are considered by RIPPER. Figure 3 : The training module In the experiment, we use 3,200 English sentences from Japan Electronic Dictionary Research Institute (EDR). EDR corpus is collected from news, novel and journal. Then our linguists manually assigned the suitable meaning of verb-to-be in Thai. In training and testing steps, we divided data into two groups. The first is 700 sentences for testing and the other is for training. We use various sizes of a training data set and different sizes of context information.</Paragraph>
      <Paragraph position="1"> Table 2, 3 and 4 are the result from C4.5, C4.5rule and RIPPER respectively. The series in columns represent the number of training sentences. The row headers show the types of context information that Pos+-n, Word+-n and P&amp;W+-n mean part-of-speech tags, words and part-of-speech tags and words with the window size is n.</Paragraph>
      <Paragraph position="2">  According to the result from C4.5 in Table 2, with data size is not more than 500 sentences, C4.5 makes good accuracy by using only part-of-speech tags with any window sizes. In case of a training data set is equal or more than 1000 sentences, considering only words give the best accuracy and the suitable window size is depend on the size of training data set. In Table 3, C4.5rule gives high accuracies on considering only part-of-speech tags with any window sizes. In table 4, RIPPER produces high accuracies by investigating only one word and one part-of-speech tag before and after verb-to-be words.</Paragraph>
      <Paragraph position="3"> Conclusion C4.5, C4.5rule and RIPPER have efficiency in extracting context information from a training corpus. The accuracy of these three machine learning techniques is not quite different, and RIPPER gives the better results than C4.5 and C4.5rule do in a small train set. The appropriate context information depends on machine learning algorithms. The suitable context information giving high accuracy in C4.5, C4.5rule and RIPPER are +-3 words around a target word, part-of-speech tags with any window sizes and +-1 word and part-of-speech tag respectively This can prove that our approach has a significant in improving a quality of translation. The advantages of our method are 1) adaptive model, 2) it can apply to another languages, and 3). It is not require linguistic knowledge. In future experiment, we will include other machine learning techniques such as Winnow[1] and increase other context information such as semantic, grammar.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML