File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2116_metho.xml

Size: 11,249 bytes

Last Modified: 2025-10-06 14:07:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2116">
  <Title>Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> in the Thai language, there is no explicit word boundary; this causes a lot of problems in Thai language processing including word segmentation, information retrieval, machine translation, and so on. Unless there is regularity in defining word entries, Thai language processing will never be effectively done. The existing Thai language processing tasks mostly rely on the hand-coded dictionaries to acquire the information about words. These manually created dictionaries have a lot of drawbacks. First, it cannot deal with words that are not registered in the dictionaries.</Paragraph>
    <Paragraph position="1"> Second, because these dictionaries are manually created, they will never cover all words that occur in real corpora. This paper, therefore, proposes an automatic word-extraction algorithm, which hopefully can overcome this Thai language-processing barrier.</Paragraph>
    <Paragraph position="2"> An essential and non-trivial task for the languages that exhibit inexplicit word boundary such as Thai, Japanese, and many other Asian languages undoubtedly is the task in identifying word boundary. &amp;quot;Word&amp;quot;, generally, means a unit of expression which has universal intuitive recognition by native speakers. Linguistically, word can be considered as the most stable unit which has little potential to rearrangement and is uninterrupted as well. &amp;quot;Uninterrupted&amp;quot; here attracts our lexical knowledge bases so much.</Paragraph>
    <Paragraph position="3"> There are a lot of uninterrupted sequences of words functioning as a single constituent of a sentence. These uninterrupted strings, of course are not the lexical entries in a dictionary, but each occurs in a very high frequency. The way to point out whether they are words or not is not distinguishable even by native speakers. Actually, it depends on individual judgement. For example, a Thai may consider 'oonfila~mu' (exercise) a whole word, but another may consider 'n~n~m~' as a compound: 'oon' (take)+ 'filg~' (power)+ 'too' (body). Computationally, it is also difficult to decide where to separate a string into words. Even though it is reported that the accuracy of recent word segmentation using a dictionary and some heuristic methods is in a high level. Currently, lexicographers can make use of large corpora and show the convincing results from the experiments over corpora. We, therefore, introduce here a new efficient method for consistently extracting and identifying a list of acceptable Thai words.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="804" type="metho">
    <SectionTitle>
2 Previous Works
</SectionTitle>
    <Paragraph position="0"> Reviewing the previous works on Thai word extraction, we found only the work of Sornlertlamvanich and Tanaka (1996). They employed the fiequency of the sorted character n-grams to extract Thai open compounds; the strings that experienced a significant change of occurrences when their lengths are extended. This algorithm reports about 90% accuracy of Thai  open compound extraction. However, the algorithm emphasizes on open compotmd extraction and has to limit tile range of n-gram to 4-20 grams for the computational reason. This causes limitation in the size of corpora and efficiency in the extraction.</Paragraph>
    <Paragraph position="1"> The other works can be found in the research on the Japanese language. Nagao et al. (1994) has provided an effective method to construct a sorted file that facilitates the calculation of n-gram data. But their algorithm did not yield satisfactory accuracy; there were many iuwflid substrings extracted. The following work (lkehara et al., 1995) improved the sorted file to avoid repeating in counting strings. The extraction cesult was better, but the determination of the longest strings is always made consecutively from left to right. If an erroneous string is extracted, its errors will propagate through the rest of the input :~trings.</Paragraph>
    <Section position="1" start_page="802" end_page="802" type="sub_section">
      <SectionTitle>
:3 Our Approach
3.1 The C4.5 Learning Algorithm
</SectionTitle>
      <Paragraph position="0"> Decision tree induction algorithms have been successfully applied for NLP problems such as sentence boundary dismnbiguation (Pahner et al.</Paragraph>
      <Paragraph position="1"> 1997), parsing (Magerman 1995) and word segmentation (Mekuavin et al. 1997). We employ the c4.5 (Quinhln 1993) decision tree induction program as the learning algorithm for word extraction.</Paragraph>
      <Paragraph position="2"> The induction algorithm proceeds by evaluating content of a series of attributes and iteratively building a tree fiom the attribute values with the leaves of the decision tree being the value of the goal attribute. At each step of learning procedure, the evolving tree is branched on the attribute that pal-titions tile data items with the highest information gain. Branches will be added until all items in the training set arc classified. To reduce the effect of overfitting, c4.5 prunes the entire decision tree constructed. It recursively examines each subtree to determine whether replacing it with a leaf or brauch woukt reduce expected error rate. This pruning makes the decision tree better in dealing with tile data different froul tile training data.</Paragraph>
    </Section>
    <Section position="2" start_page="802" end_page="803" type="sub_section">
      <SectionTitle>
3.2 Attributes
</SectionTitle>
      <Paragraph position="0"> We treat the word extraction problem as the problem of word/nou-word string disambiguation.</Paragraph>
      <Paragraph position="1"> The next step is to identify the attributes that are able to disambiguate word strings flom non-word strings. The attributes used for the learning algorithm are as follows.</Paragraph>
      <Paragraph position="2">  h{fbrmation Mutual information (Church et al. 1991) of random variable a and b is the ratio of probability that a and b co-occur, to tile indepeudent probability that a and b co-occur. High mutual information indicates that a and b co-occur lnore than expected by chance. Our algorithm employs left and right mutual information as attributes in word extraction procedure. Tile left mutual information (Lm), and right mutual information (Rm) of striug ayz are defined as:</Paragraph>
      <Paragraph position="4"> where x is the leftmost character ofayz y is the lniddle substring ol'ayz is the rightmost character of :tlVz p( ) is tile probability function. If xyz is a word, both Lm(xyz) and Rm(~yz) should be high. On the contra W, if .rye is a non-word string but consists of words and characters, either of its left or right mutual information or both lnust be low. For example, 'ml~qn~&amp;quot; ( &amp;quot;n'(a Thai alphabet) 'fl~anq'(The word means appear in Thai.) ) must have low left mutual information.</Paragraph>
      <Paragraph position="5">  Eutropy (Shannon 1948) is the information measuring disorder of wu'iables. The left and right entropy is exploited as another two attributes in our word extraction. Left entropy (Le), and right entropy (Re) of stringy are defined as:</Paragraph>
      <Paragraph position="7"> where y is the considered string, A is the set of all alphabets x, z is any alphabets in A. Ify is a word, the alphabets that come before and aflery should have varieties or high entropy. If y is not a complete word, either of its left or right entropy, or both must be low. For example, 'ahan' is not a word but a substring of word 'O~3n~l' (appear). Thus the choices of the right adjacent alphabets to '~qn' must be few and the right entropy of 'ahw, when the right adjacent alphabet is '~', must be low.</Paragraph>
      <Paragraph position="8">  It is obvious that the iterative occurrences of words must be higher than those of non-word strings. String frequency is also useful information for our task. Because the string frequency depends on the size of corpus, we normalize the count of occurrences by dividing by the size of corpus and multiplying by the average value of Thai word length:</Paragraph>
      <Paragraph position="10"> where s is the considered string N(s) is the number of the occurrences of s in corpus Sc is the size of corpus Avl is the average Thai word length. We employed the frequency value as another attribute for the c4.5 learning algorithm.  Short strings are more likely to happen by chance than long strings. Then, short and long strings should be treated differently in the disambiguation process. Therefore, string length is also used as an attribute for this task.</Paragraph>
      <Paragraph position="11">  Functional words such as '~' (will) and '~' (then) are frequently used in Thai texts. These functional words are used often enough to mislead the occurrences of string patterns. To filter out these noisy patterns from word extraction process,  A very useful process for our disambiguation is to check whether the considered string complies with Thai spelling rules or not. We employ the words in the Thai Royal Institute dictionary as spelling examples for the first and last two characters. Then we define attributes Fc(s)and Lc(s) for this task as follows.</Paragraph>
      <Paragraph position="13"> the dictionary that begin with s~s 2 N(*s,_ls,,) is the nmnber of words in the dictionary that end with s,,_~s,, ND is the number of words in the dictionary.</Paragraph>
    </Section>
    <Section position="3" start_page="803" end_page="804" type="sub_section">
      <SectionTitle>
3.3 Applying C4.5 to Thai Word Extraction
</SectionTitle>
      <Paragraph position="0"> The process of applying c4.5 to our word extraction problem is shown in Figure 1. Firstly, we construct a training set for the c4.5 learning algorithm. We apply Yamamoto et al.(1998)'s algorithm to extract all strings from a plain and unlabelled I-MB corpus which consists of 75 articles from various fields. For practical and reasonable purpose, we select only the 2-to-30character strings that occur more than 2 times,  have positive right and left entropy, and conform to simple Thai spelling rules. To this step, we get about 30,000 strings. These strings are lnalmally tagged as words or non-word strings. The strings' statistics explained above are calculated for each string. Then the strings' attributes and tags are used as the training example for the learning algorithln. The decision tree is then constructed from the training data.</Paragraph>
      <Paragraph position="1"> In order to test the decision tree, another plain I-MB corpus (the test corpus), which consists of 72 articles fi'om various fields, is employed. All strings in the test corpus are extracted and filtered out by the same process as used in the training set. After the filtering process, we get about 30,000 strings to be tested. These 30,000 strings are manually tagged in order that the precision and recall of the decision tree can be evaluated. The experimental results will be discussed in the next section.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML