File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1109_metho.xml

Size: 11,382 bytes

Last Modified: 2025-10-06 14:10:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1109">
  <Title>Study of Some Distance Measures for Language and Encoding Identi cation</Title>
  <Section position="5" start_page="64" end_page="64" type="metho">
    <SectionTitle>
3 Pruned Character N-grams
</SectionTitle>
    <Paragraph position="0"> Like in Cavnar's method, we used pruned n-grams models of the reference or training as well as test data. For each language-encoding pair, some training data is provided. A character based n-gram model is prepared from this data. N-grams of all orders are combined and ranked according to frequency. A certain number of them (say 1000) with highest frequencies are retained and the rest are dropped. This gives us the pruned character n-grams model, which is used for language-encoding identi cation.</Paragraph>
    <Paragraph position="1"> As an attempt to increase the performance, we also tried to augment the pruned character n-grams model with a word n-gram model.</Paragraph>
  </Section>
  <Section position="6" start_page="64" end_page="65" type="metho">
    <SectionTitle>
4 Distance Measures
</SectionTitle>
    <Paragraph position="0"> Some of the measures we have experimented with have already been mentioned in the section on previous work. The measures considered in this work range from something as simple as log probability difference to the one based on Jiang and Conrath (Jiang and Conrath, 1997) measure.</Paragraph>
    <Paragraph position="1"> Assuming that we have two models or distributions P and Q over a variable X, the measures (sim) are de ned as below (p and q being probabilities and r and s being ranks in models P and Q:  3. Cross entropy: sim =summationdisplay x (p(x) [?] log q(x)) (3) 4. RE measure (based on relative entropy or Kullback-Leibler distance see note below): sim =summationdisplay x p(x) log p(x)log q(x) (4) 5. JC measure (based on Jiang and Conrath's measure) (Jiang and Conrath, 1997):</Paragraph>
    <Paragraph position="3"> 6. Out of rank measure (Cavnar and Trenkle, 1994): sim =summationdisplay</Paragraph>
    <Paragraph position="5"> 7. MRE measure (based on mutual or symmetric relative entropy, the original de nition of KL-distance given by Kullback and Leibler):</Paragraph>
    <Paragraph position="7"> As can be noticed, all these measures, in a way, seem to be information theoretic in nature. However, our focus in this work is more on the presenting empirical evidence rather than discussing mathematical foundation of these measures. The latter will of course be interesting to look into.</Paragraph>
    <Paragraph position="8"> NOTE: We had initiallly experimented with relative entropy or KL-distance as de ned below (instead of the RE measure mentioned above):</Paragraph>
    <Paragraph position="10"> Another measure we tried was DL measure (based on Dekang Lin's measure, on which the JC measure is based):</Paragraph>
    <Paragraph position="12"> where A and B are as given above.</Paragraph>
    <Paragraph position="13"> The results for the latter measure were not very good (below 50% in all cases) and the RE measure de ned above performed better than relative entropy. These results have not been reported in this paper.</Paragraph>
  </Section>
  <Section position="7" start_page="65" end_page="65" type="metho">
    <SectionTitle>
5 Mutual Cross Entropy
</SectionTitle>
    <Paragraph position="0"> Cross entropy is a well known distance measure used for various problems. Mutual cross entropy can be seen as bidirectional or symmetric cross entropy. It is de ned simply as the sum of the cross entropies of two distributions with each other.</Paragraph>
    <Paragraph position="1"> Our motivation for using 'mutual' cross entropy was that many similarity measures like cross entropy and relative entropy measure how similar one distribution is to the other. This will not necessary mean the same thing as measuring how similar two distributions are to each other. Mutual information measures this bidirectional similarity, but it needs joint probabilities, which means that it can only be applied to measure similarity of terms within one distribution. Relative entropy or Kullback-Leibler measure is applicable, but as the results show, it doesn't work as well as expected.</Paragraph>
    <Paragraph position="2"> Note that some authors treat relative entropy and mutual information interchangeably. They are very similar in nature except that one is applicable for one variable in two distributions and the other for two variables in one distribution.</Paragraph>
    <Paragraph position="3"> Our guess was that symmetric measures may give better results as both the models give some information about each other. This seems to be supported by the results for cross entropy, but (asymmetric) cross entropy and RE measures also gave good results.</Paragraph>
  </Section>
  <Section position="8" start_page="65" end_page="67" type="metho">
    <SectionTitle>
6 The Algorithm
</SectionTitle>
    <Paragraph position="0"> The foundation of the algorithm for identifying the language and encoding of a text or string has already been explained earlier. Here we give a summary of the algorithm we have used. The parameters for the algorithm and their values used in our experiments reported here have also been listed.</Paragraph>
    <Paragraph position="1"> These parameters allow the algorithm to be tuned  All test data 840 or customized for best performance. Perhaps they can even be learned by using some approach as the  EM algorithm.</Paragraph>
    <Paragraph position="2"> 1. Train the system by preparing character based and word based (optional) n-grams from the training data.</Paragraph>
    <Paragraph position="3"> 2. Combine n-grams of all orders (Oc for characters and Ow for words).</Paragraph>
    <Paragraph position="4"> 3. Sort them by rank.</Paragraph>
    <Paragraph position="5"> 4. Prune by selecting only the top Nc character n-grams and Nw word n-grams for each language-encoding pair.</Paragraph>
    <Paragraph position="6"> 5. For the given test data or string, calculate the character n-gram based score simc with every model for which the system has been trained.</Paragraph>
    <Paragraph position="7"> 6. Select the t most likely language-encoding  pairs (training models) based on this character based n-gram score.</Paragraph>
    <Paragraph position="8"> 7. For each of the t best training models, calculate the score with the test model. The score is calculated as: score = simc + a [?] simw (13) where c and w represent character based and word based n-grams, respectively. And a is the weight given to the word based n-grams. In our experiment, this weight was 1 for the case when word n-grams were considered and 0 when they were not.</Paragraph>
    <Paragraph position="9"> 8. Select the most likely language-encoding pair out of the t ambiguous pairs, based on the combined score obtained from word and character based models.</Paragraph>
    <Paragraph position="10">  CN: Character n-grams only, CWN: Character n-grams plus word n-grams To summarize, the parameters in the above method are: 1. Character based n-gram models Pc and Qc 2. Word based n-gram models Pw and Qw 3. Orders Oc and Ow of n-grams models 4. Number of retained top n-grams Nc and Nw (pruning ranks for character based and word based n-grams, respectively) 5. Number t of character based models to be disambiguated by word based models 6. Weight a of word based models  Parameters 3 to 6 can be used to tune the performace of the identi cation system. The results reported in this paper used the following values of  these parameters: 1. Oc = 4 2. Ow = 3 3. Nc = 1000 4. Nw = 500 5. t = 5 6. a = 1  There is, of course, the type of similarity score, which can also be used to tune the performance. Since MCE gave the best overall performance in our experiments, we have selected it as the default score type.</Paragraph>
  </Section>
  <Section position="9" start_page="67" end_page="68" type="metho">
    <SectionTitle>
7 Implementation
</SectionTitle>
    <Paragraph position="0"> The language and encoding tool has been implemented as a small API in Java. This API uses another API to prepare pruned character and word n-grams which was developed as part of another project. A graphical user interface (GUI) has also been implemented for identifying the languages and encodings of texts, les, or batches of les.</Paragraph>
    <Paragraph position="1"> The GUI also allows a user to easily train the tool for a new language-encoding pair. The tool will be modi ed to work in client-server mode for documents from the Internet.</Paragraph>
    <Paragraph position="2"> From implementation point of view, there are some issues which can signi cantly affect the performance of the system:  1. Whether the data should be read as text or as a binary le.</Paragraph>
    <Paragraph position="3"> 2. The assumed encoding used for reading the text, both for training and testing. For example, if we read UTF8 data as ISO-8859-1, there will be errors.</Paragraph>
    <Paragraph position="4"> 3. Whether the tranining models should be read every time they are needed or be kept in memory.</Paragraph>
    <Paragraph position="5"> 4. If training models are stored (even if they are only read at the beginning and then kept in  memory), as will have to be done for practical applications, how should they be stored: as text or in binary les?  To take care of these issues, we adopted the following policy: 1. For preparing character based models, we read the data as binary les and the characters are read as bytes and stored as numbers. For word based models, the data is read as text and the encoding is assumed to be UTF8. This can cause errors, but it seems to be the best (easy) option as we don't know the actual encoding. A slightly more dif cult option to implement would be to use character based models to guess the encoding and then build word based models using that as the assumed encoding. The problem with this method will be that no programming environment supports all possible encodings. Note that since we are reading the text as bytes rather than characters for preparing 'character based n-grams', technically we should say that we are using byte based n-grams models, but since we have not tested on multi-byte encodings, a byte in our experiments was almost always a character, except when the encoding was UTF8 and the byte represented some meta-data like the script code. So, for practical purposes, we can say that we are using character based n-grams.</Paragraph>
    <Paragraph position="6"> 2. Since after pruning, the size of the models (character as well as word) is of the order of 50K, we can afford to keep the training models in memory rather than reading them every time we have to identify the language and encoding of some data. This option is naturally faster. However, for some applications where language and encoding identi cation is to be done rarely or where there is a memory constraint, the other option can be used.</Paragraph>
    <Paragraph position="7">  3. It seems to be better to store the training models in binary format since we don't know the  actual encoding and the assumed encoding for storing may be wrong. We tried both options and the results were worse when we stored the models as text.</Paragraph>
    <Paragraph position="8"> Our identi cation tool provides customizability with respect to all the parameters mentioned in this and the previous section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML