File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/95/p95-1050_concl.xml
Size: 2,391 bytes
Last Modified: 2025-10-06 13:57:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1050"> <Title>Identifying Word Translations in Non-Parallel Texts</Title> <Section position="6" start_page="321" end_page="321" type="concl"> <SectionTitle> 5 Discussion and prospects </SectionTitle> <Paragraph position="0"> It could be shown that even for unrelated English and German texts the patterns of word co-occurrences strongly correlate. The monotonically increasing chaxacter of the curves in figure 1 indicates that in principle it should be possible to find word correspondences in two matrices of ditferent languages by randomly permuting one of the matrices until the similarity function s reaches a minimum and thus indicates maximum similarity. However, the minimum-curve in figure 1 suggests that there are some deep minima of the similarity function even in cases when many word correspondences axe incorrect. An algorithm currently under consttuction therefore searches for many local minima, and tries to find out what word correspondences axe the most reliable ones. In order to limit the seaxch space, translations that axe known beforehand can be used as anchor points.</Paragraph> <Paragraph position="1"> Future work will deal with the following as yet unresolved problems: * Computational limitations require the vocabulaxies to be limited to subsets of all word types in large corpora. With criteria like the corpus frequency of a word, its specificity for a given domain, and the salience of its co-occurrence patterns, it should be possible to make a selection of corresponding vocabularies in the two languages. If morphological tools and disv~mbiguators axe available, preliminaxy lemmatiz~ tion of the corpora would be desirable.</Paragraph> <Paragraph position="2"> * Ambiguities in word translations can be taken into account by working with continuous probabilities to judge whether a word translation is correct instead of making a binary decision.</Paragraph> <Paragraph position="3"> Thereby, different sizes of the two matrices could be allowed for.</Paragraph> <Paragraph position="4"> It can be expected that with such a method the quality of the results depends on the thematic comparability of the corpora, but not on their degree of pazallelism. As a further step, even with non parallel corpora it should be possible to locate comparable passages of text.</Paragraph> </Section> class="xml-element"></Paper>