File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1109_intro.xml
Size: 6,501 bytes
Last Modified: 2025-10-06 14:03:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1109"> <Title>Study of Some Distance Measures for Language and Encoding Identi cation</Title> <Section position="4" start_page="63" end_page="64" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> Language identi cation was one of the rst natural language processing (NLP) problems for which a statistical approach was used.</Paragraph> <Paragraph position="1"> Ingle (Ingle, 1976) used a list of short words in various languages and matched the words in the test data with this list. Such methods based on lists of words or letters (unique strings) were meant for human translators and couldn't be used directly for automatic language identi cation. They ignored the text encoding, since they assumed printed text.</Paragraph> <Paragraph position="2"> Even if adapted for automatic identi cation, they were not very effective or scalable.</Paragraph> <Paragraph position="3"> However, the earliest approaches used for automatic language identi cation were based on the above idea and could be called 'translator approaches'. Newman (Newman, 1987), among others, used lists of letters, especially accented letters for various languages and identi cation was done by matching the letters in the test data to these lists.</Paragraph> <Paragraph position="4"> Beesley's (Beesley, 1988) automatic language identi er for online texts was based on mathematical language models developed for breaking ciphers. These models basically had characteristic letter sequences and frequencies ('orthographical features') for each language, making them similar to n-grams models. The insights on which they are based, as Beesley points out, have been known at least since the time of Ibn ad-Duraihim who lived in the 14th century. Beesley's method needed 6-64 K of training data and 10-12 words of test data. It treats language and encoding pair as one entity.</Paragraph> <Paragraph position="5"> Adams and Resnik (Adams and Resnik, 1997) describe a client-server system using Dunning's n-grams based algorithm (Dunning, 1994) for a variety of tradeoffs available to NLP applications like between the labelling accuracy and the size and completeness of language models. Their system dynamically adds language models. The system uses other tools to identify the text encoding. They use 5-grams with add-k smoothing. Training size was 1-50 K and test size above 50 characters.</Paragraph> <Paragraph position="6"> Some pruning is done, like for frequencies up to 3.</Paragraph> <Paragraph position="7"> Some methods for language identi cation use techniques similar to n-gram based text categorization (Cavnar and Trenkle, 1994) which calculates and compares pro les of n-gram frequencies. This is the approach nearest to ours. Such methods differ in the way they calculate the likelihood that the test data matches with one of the pro les.</Paragraph> <Paragraph position="8"> Beesley's method simply uses word-wise probabilities of 'digram' sequences by multiplying the probabilities of sequences in the test string. Others use some distance measure between training and test pro les to nd the best match.</Paragraph> <Paragraph position="9"> Cavnar also mentions that top 300 or so n-grams are almost always highly correlated with the language, while the lower ranked n-grams give more speci c indication about the text, namely the topic. The distance measure used by Cavnar is called 'out-of-rank' measure and it sums up the differences in rankings of the n-grams found in the test data as compared to the training data. This is among the measures we have tested.</Paragraph> <Paragraph position="10"> The language model used by Combrinck and Botha (Combrinck and Botha, 1994) is also based on bigram or trigram frequencies (they call them 'transition vectors'). They select the most distinctive transition vectors by using as measure the ratio of the maximum percentage of occurrences to the total percentage of occurrences of a transition vector. These distinctive vectors then form the model.</Paragraph> <Paragraph position="11"> Dunning (Dunning, 1994) also used an n-grams based method where the model selected is the one which is most likely to have generated the test string. Giguet (Giguet, 1995b; Giguet, 1995a) relied upon grammatically correct words instead of the most common words. He also used the knowledge about the alphabet and the word morphology via syllabation. Giguet tried this method for tagging sentences in a document with the language name, i.e., dealing with multilingual documents.</Paragraph> <Paragraph position="12"> Another method (Stephen, 1993) was based on 'common words' which are characteristic of each language. This methods assumes unique words for each language. One major problem with this method was that the test string might not contain any unique words.</Paragraph> <Paragraph position="13"> Cavnar's method, combined with some heuristics, was used by Kikui (Kikui, 1996) to identify languages as well as encodings for a multilingual text. He relied on known mappings between languages and encodings and treated East Asian languages differently from West European languages.</Paragraph> <Paragraph position="14"> Kranig (Muthusamy et al., 1994) and (Simon, 2005) have reviewed and evaluated some of the well known language identi cation methods. Martins and Silva (Martins and Silva, 2005) describe a method similar to Cavnar's but which uses a different similarity measure proposed by Jiang and Conrath (Jiang and Conrath, 1997). Some heuristics are also employed.</Paragraph> <Paragraph position="15"> Poutsma's (Poutsma, 2001) method is based on Monte Carlo sampling of n-grams from the beginning of the document instead of building a complete model of the whole document. Sibun and Reynar (Sibun and Reynar, 1996) use mutual information statistics or relative entropy, also called Kullback-Leibler distance for language identi cation. Souter et al.(Souter et al., 1994) compared unique character string, common word and 'trigraph' based approaches and found the last to be the best.</Paragraph> <Paragraph position="16"> Compression based approaches have also been used for language identi cation. One example of such an approach is called Prediction by Partial Matching (PPM) proposed by Teahan (Teahan and Harper, 2001). This approach uses cross entropy of the test data with a language model and predicts a character given the context.</Paragraph> </Section> class="xml-element"></Paper>