File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1109_evalu.xml
Size: 6,400 bytes
Last Modified: 2025-10-06 13:59:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1109"> <Title>Study of Some Distance Measures for Language and Encoding Identi cation</Title> <Section position="10" start_page="68" end_page="70" type="evalu"> <SectionTitle> 8 Evaluation </SectionTitle> <Paragraph position="0"> Evaluation was performed for all the measures listed earlier. These are repeated here with a code for easy reference in table-3.</Paragraph> <Paragraph position="1"> We tested on six different sizes in terms of characters, namely 100, 200, 500, 1000, 2000, and all the available test data (which was not equal for various language-encoding pairs). The number of language-encoding pairs was 53 and the minimum number of test data sets was 840 when we used all available test data. In other cases, the number was naturally larger as the test les were split in fragments (see table-2).</Paragraph> <Paragraph position="2"> The languages considered ranged from Esperanto and Modern Greek to Hindi and Telugu. For Indian languages, especially Hindi, several encodings were tested. Some of the pairs had UTF8 as the encoding, but the information from UTF8 byte format was not explicitly used for identi cation. The number of languages tested was 39 and number encodings was 19. Total number of language-encoding pairs was 53 (see table-1).</Paragraph> <Paragraph position="3"> The test and training data for about half of the pairs was collected from web pages (such as Gutenberg). For Indian languages, most (but not all) data was from what is known as the CIIL corpus. null We didn't test on various training data sizes.</Paragraph> <Paragraph position="4"> The size of the training data ranged from 2495 to 102377 words, with more on the lower side than on the higher.</Paragraph> <Paragraph position="5"> Note that we have considered the case where both the language and the encoding are unknown, not where one of them is known. In the latter case, the performance can only improve. Another point worth mentioning is that the training data was not very clean, i.e., it had noise (such as words or sentences from other languages). Error details have been given in table-4.</Paragraph> <Paragraph position="6"> These error were for MCE, both with and without word models for all the test data sizes from 200 to all available data. Most of the errors were for smaller sizes, i.e., 100 and 200 characters.</Paragraph> </Section> <Section position="11" start_page="70" end_page="70" type="evalu"> <SectionTitle> 9 Results </SectionTitle> <Paragraph position="0"> The results are presented in table-3. As can be seen almost the measures gave at least moderately good results. The best results on the whole were obtained with mutual cross entropy. The JC measure gave almost equally good results. Even a simple measure like log probability difference gave surprisingly good results.</Paragraph> <Paragraph position="1"> It can also be observed from table-3 that the size of the test data is an important factor in performance. More test data gives better results. But this does not always happen, which too is surprising.</Paragraph> <Paragraph position="2"> It means some other factors also come into play.</Paragraph> <Paragraph position="3"> One of these factors seem to whether the training data for different models is of equal size or not. Another factor seems to be noise in the data.</Paragraph> <Paragraph position="4"> This seems to affect some measures more than the others. For example, LPD gave the worst performance when all the available test data was used. For smaller data sets, noise is likely to get isolated in some data sets, and therefore is less likely to affect the results.</Paragraph> <Paragraph position="5"> Using word n-grams to augment character n-grams improved the performance in most of the cases, but for measures like JC, RE, MRE and MCE, there wasn't much scope for improvement.</Paragraph> <Paragraph position="6"> In fact, for smaller sizes (100 and 200 characters), word models actually reduced the performance for these better measures. This means either that word models are not very good for better measures, or we have not used them in the best possible way, even though intuitively they seem to offer scope for improvement when character based models don't perform perfectly.</Paragraph> </Section> <Section position="12" start_page="70" end_page="70" type="evalu"> <SectionTitle> 10 Issues and Enhancements </SectionTitle> <Paragraph position="0"> Although the method works very well even on little test and training data, there are still some issues and possible enhancements. One major issue is that Web pages quite often contain text in more than one language-encoding. An ideal language-encoding identi cation tool should be able to mark which parts of the page are in which languageencoding. null Another possible enhancement is that in the case of Web pages, we can also take into account the language and encoding speci ed in the Web page (HTML). Although it may not be correct for non-standard encodings, it might still be useful for differentiating between very close encodings like ASCII and ISO-8859-1 which might seem identical to our tool.</Paragraph> <Paragraph position="1"> If the text happens to be in Unicode, then it might be possible to identify at least the encoding (the same encoding might be used for more than one languages, e.g., Devanagari for Hindi, Sanskrit and Marathi) without using a statistical method. This might be used for validating the result from the statistical method.</Paragraph> <Paragraph position="2"> Since every method, even the best one, has some limitations, it is obvious that for practical applications we will have to combine several approaches in such a way that as much of the available information is used as possible and the various approaches complement each other. What is left out by one approach should be taken care of by some other approach. There will be some issues in combining various approaches like the order in which they have to used, their respective priorities and their interaction (one doesn't nullify the gains from another).</Paragraph> <Paragraph position="3"> It will be interesting to apply the same method or its variations on text categorization or topic identi cation and other related problems. The distance measures can also be tried for other problems. null</Paragraph> </Section> class="xml-element"></Paper>