File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1054_evalu.xml
Size: 6,347 bytes
Last Modified: 2025-10-06 14:00:14
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1054"> <Title>JAPANESE WORD SEGMENTATION BY HIDDEN MARKOV MODEL</Title> <Section position="6" start_page="285" end_page="286" type="evalu"> <SectionTitle> 4. EXPERIMENTS AND ANALYSIS </SectionTitle> <Paragraph position="0"> To train and test the hidden Markov model, a corpus of 5,529 Japanese articles that was annotated by the MAJESTY system was used since a manually annotated corpus of sufficient size was not available. From these articles, 59,587 sentences (1,882,23'~t words) were used as training material and 634 different sentences (21,430 words) were set aside as test data.</Paragraph> <Paragraph position="1"> When the trained model was run over the test sentences, it segmented 91.15% of the words correctly while achieving 96.48% accuracy on word boundaries. The correct segmentation of a single word implies that: * both its beginning and ending word boundaries are.. determined correctly, and * no extra word boundaries are generated within the word.</Paragraph> <Paragraph position="2"> The results over distinct words are given in Table 1 and the results for word boundaries are in Table 2.</Paragraph> <Paragraph position="3"> These performance figures compare favorably with the previously reported results of the BBN Japanese word segmentation and part of speech algorithm. This system, described in Section 2 and currently in use in the BBN PLUM data extraction system, achieved 91.7% accuracy in word segmentation in a test. In addition, the word segmentation HMM was designed and implemented in under one personweek, whereas the aforementioned architecture and all its components took significantly longer.</Paragraph> <Paragraph position="4"> The performance figures listed above are telling; with a simple but cleverly constructed model, the system managed to correctly segment words at a respectable rate. This performance was achieved entirely without accessing any of the word lexicons that are traditionally employed in solving this problem. Furthermore, no rule bases are referred to; the algorithm simply relies on the structure of the training data to implicitly obtain a model of Japanese word segmentation.</Paragraph> <Paragraph position="5"> While the HMM both misses and imagines word boundaries, it is encouraging that the total numbers of hypothesized words and word boundaries are close to the true numbers. This assures us that the model is generating an appropriate number of boundaries, even though it is not completely accurate on all of them.</Paragraph> <Paragraph position="6"> The fact that the model performs to such a high degree has interesting implications regarding the morphology of the Japanese language. The model relies on the idea that consecutive characters are significant with regards to whether or not they will be separated by a word boundary. This suggests that there is a set of pairs of characters which rarely occur next to one another within the same word; these are the 2-character boundary sequences used in the HMM and include at least the katakana character set as an edge. Furthermore, there must be another set of character pairs which are frequently found in succession in the same word, corresponding to the model's 2-character continuation sequences.</Paragraph> <Section position="1" start_page="285" end_page="286" type="sub_section"> <SectionTitle> 4.1. Training Set Size </SectionTitle> <Paragraph position="0"> As with any stochastic model, this HMM relies on an accurate set of probabilities which reflect the true nature of the domain.</Paragraph> <Paragraph position="1"> The limiting factor here, barring any gross problems with the model, is the amount of data on which the model is trained.</Paragraph> <Paragraph position="2"> Clearly, when the training procedure sees the first few examples, the HMM is a very poor representation of Japanese word boundaries. As such, a large amount of information is collected in a relatively short period of time in the initial stages of learning. The model will eventually become more complete as it sees a larger and larger portion of the possible 2-character sequences.</Paragraph> <Paragraph position="3"> Determining where the size of the training set no longer seems to be having a great impact on the performance of the algorithm is of interest as we can find out if the model is undertrained or over-trained. To get a sense for this, the model was trained on successively larger test sets, starting with a very small training set of 123 words up to the 1,882,231 word set, Using a logarithmic scale for the axis representing training set size gives a feeling for the additional performance accrued from more training, while factoring in the impact of the exponentially increasing advances in computing technology.</Paragraph> <Paragraph position="4"> Based on the graph, we can see that while the word segmentation error rate is diminishing more slowly as the training set size increases to 1,882,231 words (the final point plotted), the curve still exhibits a downward trend. This implies that additional training could improve the accuracy of this model.</Paragraph> <Paragraph position="5"> As expected, the largest increase in performance occurs over the initial 30,000 words where the word segmentation error rate goes from 75% to 25%. At approximately 150,000 words, the rate of change in the error rate decreases significantly, but still shows a distinct downward trend. Furthermore, the difference between the word segmentation error rate and word boundary determination error rate is continuously shrinking; it is expected that with additional training data the gap between the curves will diminish.</Paragraph> <Paragraph position="6"> To portray the amount of new information that is received over time, Figure 5 shows the number of unique 2-character sequences in each of the successively increasing training sets. It is interesting to note that the model is continuously seeing new 2-character sequences at a steady, though slightly decreasing, rate. By the time the training set numbers 50,000 words, the most common 2-character sequences have been seen and further training data, while improving test performance, provides diminishing returns due to the relative rarity of these new sequences.</Paragraph> </Section> </Section> class="xml-element"></Paper>