File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/p99-1043_concl.xml
Size: 4,411 bytes
Last Modified: 2025-10-06 13:58:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1043"> <Title>Mixed Language Query Disambiguation</Title> <Section position="6" start_page="337" end_page="338" type="concl"> <SectionTitle> 5 Conclusion and Discussion </SectionTitle> <Paragraph position="0"> Mixed-language query occurs very often in both spoken and written form, especially in Asia.</Paragraph> <Paragraph position="1"> Such queries are usually in complete sentences instead of concatenated word strings because they are closer to the spoken language and more natural for user. A mixed-language sentence consists of words mostly in a primary language and some in a secondary language. However, even though mixed-languages are in sentence form, they are difficult to parse and tag because those secondary language words introduce an ambiguity factor. To understand a query can mean finding the matched document, in the case of Web search, or finding the corresponding semantic classes, in the case of an interactive system. In order to understand a mixed-language query, we need to translate the secondary language words into primary language unambiguously. null In this paper, we present an approach of mixed,language query disambiguation by using co-occurrence information obtained from a monolingual corpus. Two new types of disambiguation features are introduced, namely voting contextual words and 1-best contextual word. These two features are compared to the baseline feature of a single neighboring word.</Paragraph> <Paragraph position="2"> Assuming the primary language is English and the secondary language Chinese, our experiments on English-Chinese mixed language show that the average translation accuracy for the baseline is 75.50%, for the voting model is 81.37% and for the 1-best model, 83.72%.</Paragraph> <Paragraph position="3"> The baseline method uses only the neighboring word to disambiguate C. The assumption is that the neighboring word is the most semantic relevant. This method leaves out an important feature of nature language: long distance dependency. Experimental results show that it is not sufficient to use only the nearest neighboring word for disambiguation.</Paragraph> <Paragraph position="4"> The performance of the voting method is better than the baseline because more contextual words are used. The results are consistent with the idea in (Gale and Church, 1994; Shfitze, 1992; Yarowsky, 1995).</Paragraph> <Paragraph position="5"> In our experiments, it is found that 1-best contextual word is even better than multiple contextual words. This seemingly counter-intuitive result leads us to believe that choosing the most discriminative single word is even more powerful than using multiple contextual word equally. We believe that this is consistent with the idea of using &quot;trigger pairs&quot; in (Rosenfeld, 1995) and Singular Value Decomposition in (Shiitze, 1992).</Paragraph> <Paragraph position="6"> We can conclude that sometimes long-distance contextual words are more discriminant than immediate neighboring words, and that multiple contextual words can contribute to better disambiguation.Our results support our belief that natural sentence-based queries are less ambiguous than keyword based queries.</Paragraph> <Paragraph position="7"> Our method using multiple disambiguating contextual words can take advantage of syntactic information even when parsing or tagging is not possible, such as in the case of mixed-language queries.</Paragraph> <Paragraph position="8"> Other advantages of our approach include: (1) the training is unsupervised and no domain-dependent data is necessary, (2) neither bilingual corpora or mixed-language corpora is needed for training, and (3) it can generate monolingual queries in both primary and secondary languages, enabling true cross-language IR.</Paragraph> <Paragraph position="9"> In our future work, we plan to analyze the various &quot;discriminating words&quot; contained in a mixed language or monolingual query to find out which class of words contribute more to the final disambiguation. We also want to test the significance of the co-occurrence information of all contextual words between themselves in the disambiguation task. Finally, we plan to develop a general mixed-language and cross-language understanding framework for both document retrieval and interactive tasks.</Paragraph> </Section> class="xml-element"></Paper>