File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/a00-2011_concl.xml
Size: 2,080 bytes
Last Modified: 2025-10-06 13:52:38
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2011"> <Title>Word-for-Word Glossing with Contextually Similar Words</Title> <Section position="12" start_page="83" end_page="84" type="concl"> <SectionTitle> 6 Available at babelfish.altavista.com/cgi-bin/translate </SectionTitle> <Paragraph position="0"> The algorithm presented in this paper can be improved and extended in many ways. At present, our glossing algorithm does not take the prior probabilities of translations into account.</Paragraph> <Paragraph position="1"> For example, in WSJ, the bank account sense of account is much more common than the report sense. We should thus tend to prefer this sense of account. This is achievable by weighting the translation scores by the prior probabilities of the translations. We are investigating an Expectation-Maximization (EM) (Dempster et al., 1977) algorithm to learn these prior probabilities. Initially, we assume that the candidate translations for a word are uniformly distributed. After glossing each word in a large corpus, we refine the prior probabilities using the frequency counts obtained. This process is repeated several times until the empirical prior probabilities closely approximate the true prior probabilities.</Paragraph> <Paragraph position="2"> Finally, as discussed in Section 2.3, automatically constructing the bilingual thesaurus is necessary to gloss whole documents. This is attainable by adding a corpus-based destination language thesaurus to our system. The process of assigning a cluster of similar words as a WAT to a candidate translation c is as follows. First, we automatically obtain the candidate translations for a word using a bilingual dictionary. With the destination language thesaurus, we obtain a list S of all words similar to c. With the bilingual dictionary, replace each word in S by its source language translations. Using the group similarity metric from Section 5, assign as the WAT the cluster of similar words (obtained from the source language thesaurus) most similar to S.</Paragraph> </Section> class="xml-element"></Paper>