File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1607_metho.xml
Size: 18,596 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1607"> <Title>Building a training corpus for word sense disambiguation in English-to-Vietnamese Machine Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Collecting English-Vietnamese </SectionTitle> <Paragraph position="0"> bilingual texts When chosing this bilingual approach, we have met many difficulties. Firstly, due to no official English-Vietnamese bilingual corpus available up to now, we have had to build them by ourselves by collecting English-Vietnamese bilingual texts from selected sources. Secondly, as most of these sources are not electronic forms, we must convert them into electronic form. During the process of electronic conversion, we have met another drawback.</Paragraph> <Paragraph position="1"> That is: there is no effective OCR (Optical Character Recognition) software available for Vietnamese characters. Compared with English OCR softwares, Vietnamese OCR one is lower just because Vietnamese characters have tone marks (acute, breve, question, tilde, dot below) and diacritics (hook, caret,..). So, we must manually input most of Vietnamese texts (lowquality hardcopies). Only OCR of high-quality hardcopies has been used and manually revised.</Paragraph> <Paragraph position="2"> During collecting English-Vietnamese bilingual texts (figure 1), we choose only following materials: - Science or techniques materials.</Paragraph> <Paragraph position="3"> - Conventional examples in dictionaries.</Paragraph> <Paragraph position="4"> - Bilingual texts that their translations are exact (translated by human translator and published by reputable publishers) and not too diversified (no &quot;one-to-one&quot; translation).</Paragraph> <Paragraph position="5"> So far, we have collected a 5,000,000-word corpus containing 400,000 sentences (most of them are texts in science and conventional fields).</Paragraph> <Paragraph position="6"> (1) SUSANNE (Surface and Underlying Structural ANalyses of Naturalistic English) is constructed by Geoffrey Sampson (1995) at Sussex University, UK. Vietnamese translation is performed by English teacher of VNU-HCMC.</Paragraph> <Paragraph position="7"> (2) Vietnamese &quot;word&quot; is a special linguistic unit in Vietnamese language only, which is often called &quot;tieang&quot;. This lexical unit is lower than traditional words but higher than traditional morphemes. However, after the collection, we must convert them into unified forms (normalization) by aligning sentences as follows.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Sentence-alignment of bilingual corpus </SectionTitle> <Paragraph position="0"> During inputting this bilingual corpus, we have aligned sentences manually under the following format: *D02:01323: The announcement of the royal birth was broadcast to the nation.</Paragraph> <Paragraph position="1"> +D02:01323: Looi loan bauo soi ra nooi cuua noua con hoaong toac nao nooic truyean thanh trean toaon quoac.</Paragraph> <Paragraph position="2"> *D02:01324: Announcements of births, marriages and deaths appear in some newspapers.</Paragraph> <Paragraph position="3"> +D02:01324: Nhoong thoang bauo vea soi ra nooi, cooui houi, tang chea xuaat hiean trean moat vaoi too bauo.</Paragraph> <Paragraph position="4"> In which, first characters are reference numbers indicating its sources and the position of sentence in texts.</Paragraph> <Paragraph position="5"> Because most of our bilingual corpus are manually typed, we haven't used automatic sentential alignment. Automatic sentential alignment (Gale and Church, 1991) will be necessary if we have already had online bilingual texts.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Spelling Checker of bilingual corpus </SectionTitle> <Paragraph position="0"> After aligning sentences, we check the spell of English words and Vietnamese words automatically. Here, we have met another drawback in processing the Vietnamese word segmentation because Vietnamese words (similar to Chinese words) are not delimited by spaces (Dien Dinh, 2001). However, our spelling checker is able to detect non-existent words in English or Vietnamese only. So, we must review this corpus manually. In fact, Vietnamese &quot;word&quot; here is only &quot;tieang&quot;, which is equivalent to Vietnamese &quot;spelling word&quot; or &quot;morpheme&quot; (due to features of isolated language typology).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="40" type="metho"> <SectionTitle> 4 Annotating bilingual corpus </SectionTitle> <Paragraph position="0"> The main section in this paper is to annotate the semantic labels. To carry out this task, we have taken advantage of classification of semantic classes in LLOCE. We considered these class names as semantic tags and assign them to English words in source sentences. In this section, we concentrate on annotating semantic tags via class-based word alignment in English-Vietnamese bilingual corpus.</Paragraph> <Paragraph position="1"> There are many approaches to word alignment in biligual corpora such as: statistics-based (Brown, 1993), patern-based mapping (Melamed I.D. 2000), class-based (Sue Ker J.</Paragraph> <Paragraph position="2"> and Jason Chang S. 1997), etc. Because our main focus is semantical tagging, we have chosen the class-based approach to word alignment. This approach was firstly suggested by Sue J.Ker and Jason S. Chang (1997) in word alignment of English-Chinese bilingual corpus.</Paragraph> <Paragraph position="3"> However, instead of using LDOCE (Longman Dictionary Of Contemporary English) for English and CILIN for Chinese, we use LLOCE enhanced by Synsets of WordNet for both English and Vietnamese. Thank to this enhanced LLOCE (40,000 entries), our class dictionary enjoys more coverage than the original LLOCE (only 16,000 entries).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Classes in LLOCE </SectionTitle> <Paragraph position="0"> According to a report of EAGLES (1998), LLOCE is a small size learner style dictionary largely derived from LDOCE and organized along semantic principles. A quantitative profile of the information provided is given in table 2 below.</Paragraph> <Paragraph position="1"> Semantic classification in LLOCE is articulated in 3 tiers of increasingly specific concepts represented as major, group and set codes, e.g. <MAJOR: A> Life and living things <GROUP: A50-61> Animals/Mammals <SET: A53> The cat and similar animals: cat, leopard, lion, tiger,...</Paragraph> <Paragraph position="2"> Each entry is associated with a set code, e.g. <SET: A53> nouns The cat and similar animals Relations of semantic similarity between codes not expressed hierarchically are crossreferenced. null There are 14 major codes, 127 group codes and 2441 set codes. The list of major codes below provides a general idea of the semantic areas covered: 1. <A> Life and living things 2. <B> The body, its functions and welfare 3. <C> People and the family 4. <D> Buildings, houses, the home, clothes, belongings, and personal care 5. <E> Food, drink, and farming 6. <F> Feelings, emotions, attitudes, and sensations 7. <G> Thought and communication, language and grammar 8. <H> Substances, materials, objects, and equipment 9. <I> Arts and crafts, sciences and technology, industry and education 10. <J> Numbers, measurement, money, and commerce 11. <K> Entertainment, sports, and games 12. <L> Space and time 13. <M> Movement, location, travel, and transport 14. <N> General and abstract terms.</Paragraph> </Section> <Section position="2" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 4.2 Class-based word-alignment </SectionTitle> <Paragraph position="0"> We can see clearly that computers cannot understand human dictionary, it only can recognize machine dictionary (called MRD), leading to a limitation in vocabulary as well as ambiguity in semantics when we align words relying on dictionary. So class-based alignment is a solution supplementing the in-context translations concept.</Paragraph> <Paragraph position="1"> In order to get a good result when using class-based algorithm, words in both English and Vietnamese have to be classified based on their senses (Resnik, 1999). And the ways we use to classify them should be as identical as possible. So we have chosen words in its classes corresponding to those in LLOCE. Vietnamese word-classes are named after the available names of English ones. These seed lexicons must have large coverages. So after building these lexicons, we use some more reliable thesauri to enrich them.</Paragraph> <Paragraph position="2"> construction For the sake of convenience, we call Vietnamese word-class lexicon &quot;CVDic&quot;. Words in this lexicon are classified into many groups. Each group has a unique name called class-code. If knowing one class-code, we can easily know the number of words of that word-class and even what these words are.</Paragraph> <Paragraph position="3"> Step 1:, translations of one English word in LLOCE are sequentially inserted in turn to the corresponding class of CVDic.</Paragraph> <Paragraph position="5"> When looking ew up in LLOCE, we obtain its synonymous translations : vw1, vw2, vw3, ...</Paragraph> <Paragraph position="6"> Then vw1, vw2, vw3 ... are added to CVDic as : VC vw1, vw2, vw3 ...</Paragraph> <Paragraph position="7"> As a result, each word class of CVDic includes at least one translation word. Normally, the number of synonyms in Vietnamese are very large because the richness in the way of translation is one of the characteristics of Vietnamese.</Paragraph> <Paragraph position="8"> Step 2 :, we increase the coverage of the CVDic by using the English Vietnamese lexicon. Senses of one word of this English-Vietnamese lexicon are organised in synonym groups. For each word in the right hand side, we find if it appears in some word-classes of the CVDic, then adding the whole group of VEDic to that class of CVDic.</Paragraph> <Paragraph position="9"> We consider VG as a Vietnamese synonym English word-class lexicon As you can see, Wordnet (Miller, 1996) is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized into synonym sets. We take advantages of this valuable resource to add more words to word classes in the English word-class lexicon, CEDic.</Paragraph> <Paragraph position="10"> In WordNet, English words are grouped in ,...), this classification model is much more detailed than the one in LLOCE. Therefore, if any two Synsets in these Synsets contain two words which belong to the same word-class, we add the words of the intersection of these two Synsets to that word-class. That means:</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 Word alignment algorithm </SectionTitle> <Paragraph position="0"> Before describing this algorithm briefly, we have following conventions: S stands for English sentence and T stands for Vietnamese one. We have sentence pair translated by each other is (S,T), s is the word in S, t is the word in T which is translated by s in S in context. DTs is the set of dictionary meanings for s entry, each meaning is represented by d.</Paragraph> <Paragraph position="2"> possible words presented in T.</Paragraph> <Paragraph position="3"> where : VD is the Vietnamese Dictionary containing Vietnamese possible words and phrases.</Paragraph> <Paragraph position="4"> The problem is how computers can recognise</Paragraph> <Paragraph position="6"> , we can solve the case resulting in the wrong definitions of words in Vietnamese sentences when we only carry out word segment relying on VD. Our algorithm is in conformity with the following steps.</Paragraph> <Paragraph position="7"> We mainly calculate the similarity on morpheme</Paragraph> <Paragraph position="9"> based on formula calculating Dice coefficient (Dice, 1945) as follows: where: |d |and |t |: the number of morphemes in d and in t.</Paragraph> <Paragraph position="10"> |d [?] t |: the number of morphemes in the intersection of d and t.</Paragraph> <Paragraph position="11"> Next, for each word pair (s, t) obtained from Then, we choose candidate translation pairs of greatest likelihood of connection.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3.2 Calculating the correlation between </SectionTitle> <Paragraph position="0"> two classes of two languages The correlation ratio of class X and class Y can be measured using the Dice coefficient as follows: Where |X|= the total number of the words in X, |Y|= the total number of the words in Y,</Paragraph> <Paragraph position="2"> obtained by running above dictionary-based word alignment over the bilingual corpus.</Paragraph> <Paragraph position="3"> 4.3.3 Estimating the likelihood of candidate translation pairs A coefficient, presented by Brown (1993) establishing each connection is a probabilistic value Pr(s,t), showing translated probability of each pair (s,t) in (S,T), calculated by product of dictionary translated probability, t(s |t), and dislocated probability of words in sentences, d (i |j, l, m). However Sue J. Ker and Jason S.</Paragraph> <Paragraph position="4"> Chang did not agree with it completely. In their opinion, it is very difficult to estimate t(s, t) and d(i, j) exactly for all values of s, t, i, j in the formula: We have the same opinion with them. We can create functions based on dictionary, word concept and position of words in sentences to limit cases to be examined and computed.</Paragraph> <Paragraph position="5"> The similar concept of word pair (s, t) function: Then, combining with DTSim(s, t), we have four value of t(s, t). We have to combine with DTSim(s, t) because we are partially basing on dictionary. Besides, we can solve the case that there are many words belonging to the same class in sentences.</Paragraph> <Paragraph position="7"> Where h1 and h2 are thresholds chosen via experimental results.</Paragraph> </Section> <Section position="5" start_page="2" end_page="40" type="sub_section"> <SectionTitle> 4.4 Result of sense tagging for corpus </SectionTitle> <Paragraph position="0"> Because we have made use class-based word alignment as described above, after aligning words in bilingual corpus, we determine the semantic class of each word. For example: according to classification of LLOCE, the word &quot;letter&quot; has 2 meanings, one is &quot;message&quot; (if it belongs to class G155) and one is &quot;alphabet&quot; (if it belongs to class G148). Similarly, the word &quot;bank&quot; has 3 meanings, one is &quot;money&quot; (if it belongs to class J104), one is &quot;river&quot; (if it belongs to class L99) and one is &quot;line&quot; (if it belongs to J41 class). After aligning words, we have semantic tags as follows: In this case, &quot;bank&quot; belongs to J104 class, that is the meaning of &quot;bank&quot; is &quot;money&quot;.</Paragraph> </Section> <Section position="6" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 4.5 Evaluation of sense tagging for </SectionTitle> <Paragraph position="0"> corpus To evaluate the accuracy of our sense tagging in our corpus, we compare our result with SEMCOR (Shari Landes et. al., 1999) on SUSANNE (Geoffrey Sampson, 1995) part only. We have done manual comparison because there are differences between semantic tags of LLOCE and SEMCOR. The result is: 70% of annotated words are assigned correct sense tags.</Paragraph> <Paragraph position="1"> 4.6 Applying sense tagged corpus for WSD After annotating the bilingual corpus (mainly English texts), we will apply TBL method of Eric Brill (1993) to extract disambiguation rules based on POS, syntactic and semantics information around the polysemous (ambiguous) words.</Paragraph> <Paragraph position="2"> Firstly, we proceed the initially tagging for all words (except stopwords) with &quot;naive&quot; labels (most probable labels of this word). Secondly, the learner will generate rules that match the templates showing the format of the rules.</Paragraph> <Paragraph position="3"> All possible rules that match the templates and replace the wrong tags with the correct ones are generated by the learner. In order to know whether this tag is correct or not, we must base on the training corpus (annotated corpus from section 4). TBL method has rules under following templates as follows: If we call semantic label (classification of LLOCE) X and Y,.., the template will have following format: &quot;Change X into Y if the Z condition is met&quot;. The Z condition may be a word form, or a Part-Of-Speech (POS), or a syntactic label, or a semantic label. Thus, we must assign each English word to an appropriate POS tag by an available POS-tagger (such as POS-tagger of Eric Brill) and syntactic label by an available parser (such as : APP, PCPATR, ...). After annotating morphological, syntactical and semantic labels, we will apply the above templates in which Z condition has one of following formats: * The i th -word to the left/right of the ambiguous word is a certain &quot;word form W&quot; or a certain symbol.</Paragraph> <Paragraph position="4"> * The i th -word to the left/right of the ambiguous word is a certain POS k (lexical tag).</Paragraph> <Paragraph position="5"> * The i th -word to the left/right of the ambiguous word is a syntactical function (e.g. Subject or Object) of the ambiguous word (syntactic tags).</Paragraph> <Paragraph position="6"> * The i th -word to the left/right of the ambiguous word is a certain semantic label L.</Paragraph> <Paragraph position="7"> After using the above templates to extract transformation rules through training stages, we must manually revise them. We will consider these true and reasonable transformation rules as disambiguation ones which can be applied in the WSD module of English-to-Vietnamese MT system.</Paragraph> </Section> </Section> class="xml-element"></Paper>