File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/h05-1110_concl.xml
Size: 3,605 bytes
Last Modified: 2025-10-06 13:54:30
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1110"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 875-882, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Inducing a multilingual dictionary from a parallel multitext in related languages</Title> <Section position="8" start_page="881" end_page="881" type="concl"> <SectionTitle> 6 Discussion and Future Work </SectionTitle> <Paragraph position="0"> We have built a system for multi-dictionary induction from parallel corpora which significantly improves quality over the standard existing tool (GIZA) by taking advantage of the fact that languages are related and we have a group of more than two of them. Because the system attempts to be completely agnostic about the languages it works on, it might be used successfully on many language groups, requiring almost no linguistic knowledge on the part of the user. Only the prefix and suffix components are somewhat language-specific, but even they are sufficiently general to work, with varying degree of success, on most inflective and agglutinative languages (which form a large majority of languages). For generality, we would also need a model of infixes, for languages such as Hebrew or Arabic.</Paragraph> <Paragraph position="1"> We must admit, however, that we have not tested our approach on other language families yet. It is our short term plan to test our model on several Romance languages, e.g. Spanish, Portuguese, French.</Paragraph> <Paragraph position="2"> Looking at the first lines of Table 1, one can see that using more than a pair of languages with a model using only a small feature set can dramatically improve performance (compare second and third columns), while able to find the optimal values for all internal parameters.</Paragraph> <Paragraph position="3"> As discussed in the introduction, the ultimate goal of this project is to produce tools, such as a parser, for languages which lack them. Several approaches are possible, all involving the use of the dictionary we built. While working on this project, we would no longer be treating all languages in the same way.</Paragraph> <Paragraph position="4"> We would use the tools available for that language to further improve the performance of pairwise models involving that language and, indirectly, even the pairs not involving this language. Using these tools, we may be able to improve the word translation model even further, simply as a side effect.</Paragraph> <Paragraph position="5"> Once we build a high-quality dictionary for a special domain such as the Bible, it might be possible to expand to a more general setting by mining the Web for potential parallel texts.</Paragraph> <Paragraph position="6"> Our technique is limited in the coverage of the resulting dictionary which can only contain words which occur in our corpus. Whatever the corpus may be, however, it will include the most common words in the target language. These are the words that tend to vary the most between related (and even unrelated) languages. The relatively rare words (e.g.</Paragraph> <Paragraph position="7"> domain-specific and technical terms) can often be translated simply by inferring morphological rules transforming words of one language into another.</Paragraph> <Paragraph position="8"> Thus, one may expand the dictionary coverage using non-parallel texts in both languages, or even in just one language if its morphology is sufficiently regular.</Paragraph> </Section> class="xml-element"></Paper>