File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1039_metho.xml
Size: 8,917 bytes
Last Modified: 2025-10-06 14:14:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1039"> <Title>Hybrid Approaches to Improvement of Translation Quality in Web-based English-Korean Machine Translation</Title> <Section position="4" start_page="252" end_page="252" type="metho"> <SectionTitle> 3 Compound Unit Recognition </SectionTitle> <Paragraph position="0"> parsing mechanism. Partial parser operates on cyclic trie and simple CFG rules for the fast syntactic constraint check. The experimental result showed our syntactic verification increased the precision of CU recognition to 99.69%.</Paragraph> </Section> <Section position="5" start_page="252" end_page="252" type="metho"> <SectionTitle> 4 Competitive Learning Grammar </SectionTitle> <Paragraph position="0"> One of the problems of rule-based translation has been the idiomatic expression which has been dealt mainly with syntactic grammar rules (Katoh and Aizawa, 1995) &quot;Mary keeps up with her brilliant classmates.&quot; and &quot;I prevent him from going there.&quot; are simple examples of uninterupted and interupted idiomatic expressions expectively.</Paragraph> <Paragraph position="1"> In order to solve idiomatic expressions as well as collocations and frozen compound nouns, we have developed the compound unit(CU) recognizer (Jung et. al., 1997). It is a plug-in model locating between morphological and syntactic analyzer. Figure 2 shows the structure The recognizer searches all possible CUs in the input sentence using co-occurrence constraint string/POS and syntactic constraint and makes the CU index. Syntactic verifier checks the syntactic verification of variable constituents in CU. For syntactic verifier we use a partial For the parse tree ranking of too many ambiguities in English syntactic analysis, we use the mechanism to insert the competitive probabilistics into the rules. To decide the correct parse tree ranking, we compare two partial parse trees on the same node level with competitive relation and add ct (currently, 0.01) to the better one, but subtract ct from the worse one on the base of the intuition of linguists. This results now in raising the better parse tree higher in the ranking list of the parse trees than the worse one.</Paragraph> </Section> <Section position="6" start_page="252" end_page="253" type="metho"> <SectionTitle> 5 Robust Translation </SectionTitle> <Paragraph position="0"> In order to deal with long sentences, parsingfailed or ill-formed sentences, we activate the robust translation. It consists of two steps: first, long sentence segmentation and then fail softening.</Paragraph> <Section position="1" start_page="252" end_page="253" type="sub_section"> <SectionTitle> 5.1 Long Sentence Segmentation </SectionTitle> <Paragraph position="0"> The grammar rules have generally a weak point to cover long sentences. If there are no grammar rules to process a long sentence, the whole parse tree of a sentence can not be produced. Long sentence segmentation produces simple from long sentences before parsing fragements fails.</Paragraph> <Paragraph position="1"> We use the clue of the sentence POS sequence of input sentence as a segmentation. If the length of input exceeds pre-defined threshold, currently 21 for segmentation level I and 25 for level II, the sentence is divided into two or more parts. Each POS trigram is separately applied to the level 1 or II. After segmenting, each part of input sentence is analyzed and translated. The following example shows an extremely long sentence (45 words) and its long sentence segmentation result.</Paragraph> <Paragraph position="2"> \[Input sentence\] &quot;Were we to assemble a Valkyrie to challenge IBM, we could play Deep Blue in as many games as IBM wanted us to in a single match, in fact, we could even play multiple games at the same time. Now - - wouldn't that be interesting?&quot;</Paragraph> </Section> <Section position="2" start_page="253" end_page="253" type="sub_section"> <SectionTitle> \[Long Sentence Segmentation\] </SectionTitle> <Paragraph position="0"> &quot;Were we to assemble a Valkyrie to challenge IBM, / (noun PUNCT pron) we could play Deep Blue in as many games as IBM wanted us to in a single match, / (noun PUNCT adv) in fact, / (noun PUNCT pron) we could even play multiple games at the same time, / (adv PUNCT adv) Now - - / (PUNCT PUNCT aux) wouldn't that be interesting?&quot;</Paragraph> </Section> <Section position="3" start_page="253" end_page="253" type="sub_section"> <SectionTitle> 5.2 Fail Softening </SectionTitle> <Paragraph position="0"> For robust translation we have a module 'fail softening' that processes the failed parse trees in case of parsing failure. Fail softening finds set of edges that covers a whole input sentence and makes a parse tree using a virtual sentence tag.</Paragraph> <Paragraph position="1"> We use left-to-right and right-to-left scanning with &quot;longer-edge-first&quot; policy. In case that there is no a set of edges for input sentence in a scanning, the other scanning is preferred. If both make a set of edges respectively, &quot;smaller-setfirst&quot; policy is applied to select a preferred set, that is, the number of edges in one set should be smaller than that of the other (e.g. if n(LR)=6 and n(RL)=5, then n(RL) is selected as the first ranked parse tree, where n(LR) is the number of left-to-right scanned edges, and n(RL) is the number of right-to-left scanned edges). We use a virtual sentence tag to connect the selected set of edges. One of our future works is to have a mechanism to give a weight into each edge by syntactic preference.</Paragraph> </Section> </Section> <Section position="7" start_page="253" end_page="253" type="metho"> <SectionTitle> 6 Large Collocation Dictionary </SectionTitle> <Paragraph position="0"> We select a correct word equivalent by using lexical semantic marker as information constraint and large collocation dictionary in the transfer phase.</Paragraph> <Paragraph position="1"> The lexical semantic marker is applied to the terminal node for the relational representation, while the collocation information is applied to the non-terminal node.</Paragraph> <Paragraph position="2"> The large collocation dictionary has been collected from two resources; EDR dictionary and Web documents.</Paragraph> </Section> <Section position="8" start_page="253" end_page="254" type="metho"> <SectionTitle> 7 Test and Evaluation </SectionTitle> <Paragraph position="0"> A semi-automated decision tree of our domain recognizer uses as a feature twenty to sixty keywords which are representative words extracted from twenty-five domains. To raise the accuracy of the domain identifier, manually chosen words has been also added as features.</Paragraph> <Paragraph position="1"> For learning of the domain identifier, each thousand sentence from twenty-five domains is used as training sets. We tested 250 sentences that are the summation of each ten sentences extracted from twenty-five domains. These test sentences were not part of training sets. The domain identifier outputs two top domains as its result. The accuracy of first top domain shows 45% for 113 sentences. When second top domains are applied, the accuracy rises up to 75%.</Paragraph> <Paragraph position="2"> In FromTo/EK, the analysis dictionary consists of about 70,000 English words, 15,000 English compound units, 80,000 English-Korean bilingual words, and 50,000 bilingual collocations. The domain dictionary has 5,000 words for computer science that were extracted from IEEE reports.</Paragraph> <Paragraph position="3"> In order to make the evaluation as objective as possible we compared FromTo/EK with MATES/EK on 1,708 sentences in the IEEE computer magazine September 1991 issue, which MATES/EK had tested in 1994 and</Paragraph> </Section> <Section position="9" start_page="254" end_page="254" type="metho"> <SectionTitle> 3 (Good) 2 (OK) </SectionTitle> <Paragraph position="0"> The meaning of the sentence is almost clear.</Paragraph> <Paragraph position="1"> The meaning of the sentence can be understood after several readings. 1 (Poor) The meaning of the sentence can be guessed only after a lot of readings. 0(Fail) The meaning of the sentence cannot be guessed at all.</Paragraph> <Paragraph position="2"> With the evaluation criteria three master degree students whom we randomly selected compared and evaluated the translation results of 1,708 sentences of MATES/EK and those of FromTo/EK. We have considered the degrees 4, 3, and 2 in the table 1 as successful translation results. Figure 3 shows the evaluation result. tSumh~ d mo~,~lb.</Paragraph> <Paragraph position="3"> Figure 3 shows a translation quality of both FromTo/EK and MATES/EK according to the length of a sentence. More than 84% of sentences that FromTo/EK has translated is understood by human being.</Paragraph> </Section> class="xml-element"></Paper>