File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/w06-1109_abstr.xml
Size: 1,498 bytes
Last Modified: 2025-10-06 13:45:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1109"> <Title>Study of Some Distance Measures for Language and Encoding Identi cation</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> To determine how close two language models (e.g., n-grams models) are, we can use several distance measures. If we can represent the models as distributions, then the similarity is basically the similarity of distributions. And a number of measures are based on information theoretic approach. In this paper we present some experiments on using such similarity measures for an old Natural Language Processing (NLP) problem. One of the measures considered is perhaps a novel one, which we have called mutual cross entropy. Other measures are either well known or based on well known measures, but the results obtained with them vis-a-vis one-another might help in gaining an insight into how similarity measures work in practice.</Paragraph> <Paragraph position="1"> The rst step in processing a text is to identify the language and encoding of its contents. This is a practical problem since for many languages, there are no universally followed text encoding standards. The method we have used in this paper for language and encoding identi cation uses pruned character n-grams, alone as well augmented with word n-grams. This method seems to give results comparable to other methods.</Paragraph> </Section> class="xml-element"></Paper>