File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1061_evalu.xml
Size: 14,491 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1061"> <Title>Satoshi Sekine ++</Title> <Section position="7" start_page="0" end_page="93" type="evalu"> <SectionTitle> 4 Experiments and Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Conditions </SectionTitle> <Paragraph position="0"> In our experiments, we used 744,204 short words and 618,538 long words for training, and 63,037 short words and 51,796 long words for testing.</Paragraph> <Paragraph position="1"> Those words were extracted from one tenth of the CSJ that already had been manually tagged. The training corpus consisted of 319 speeches and the test corpus consisted of 19 speeches.</Paragraph> <Paragraph position="2"> Transcription consisted of basic form and pronunciation, as shown in Fig. 1. Speech sounds were faithfully transcribed as pronunciation, and also represented as basic forms by using kanji and hiragana characters. Lines beginning with numerical digits are time stamps and represent the time it took to produce the lines between that time stamp and the next time stamp. Each line other than time stamps represents a bunsetsu. In our experiments, we used only the basic forms. Basic forms were tagged with several types of labels such as fillers, as shown in dled according to rules as shown in the rightmost columns in Table 1.</Paragraph> <Paragraph position="3"> Since there are no boundaries between sentences in the corpus, we selected the places in the CSJ that</Paragraph> <Paragraph position="5"> ) leave the latter candidate null are automatically detected as pauses of 500 ms or longer and then designated them as sentence boundaries. In addition to these, we also used utterance boundaries as sentence boundaries. These are automatically detected at places where short pauses (shorter than 200 ms but longer than 50 ms) follow the typical sentence-ending forms of predicates such as verbs, adjectives, and copula.</Paragraph> <Paragraph position="6"> In the CSJ, bunsetsu boundaries, which are phrase boundaries in Japanese, were manually detected. Fillers and disfluencies were marked with the labels (F) and (D). In the experiments, we eliminated fillers and disfluencies but we did use their positional information as features. We also used as features, bunsetsu boundaries and the labels (M), (O), (R), and (A), which were assigned to particular morphemes such as personal names and foreign words. Thus, the input sentences for training and testing were character strings without fillers and disfluencies, and both boundary information and various labels were attached to them. Given a sentence, for every string within a bunsetsu and every string appearing in a dictionary, the probabilities of a in Eq. (1) were estimated by using the morpheme model. The output was a sequence of morphemes with grammatical attributes, as shown in Fig. 2. We used the POS categories in the CSJ as grammatical attributes. We obtained 14 major POS categories for short words and 15 major POS categories for long words. Therefore, a in Eq. (1) can be one of 15 tags from 0 to 14 for short words, and it can be one of 16 tags from 0 to The features we used with morpheme models in our experiments are listed in Table 2. Each feature consists of a type and a value, which are given in the rows of the table, and it corresponds to j in the function g i,j (a,b) in Eq. (1). The notations &quot;(0)&quot; and &quot;(-1)&quot; used in the feature-type column in Table 2 respectively indicate a target string and the morpheme to the left of it. The terms used in the table are basically as same as those that Uchimoto et al. used (Uchimoto et al., 2002). The main difference is the following one: Boundary: Bunsetsu boundaries and positional information of labels such as fillers. &quot;(Beginning)&quot; and &quot;(End)&quot; in Table 2 respectively indicate whether the left and right side of the target strings are boundaries.</Paragraph> <Paragraph position="7"> We used only those features that were found three or more times in the training corpus.</Paragraph> <Paragraph position="8"> We used the following information as features on the target word: a word and its POS category, and the same information for the four closest words, the two on the left and the two on the right of the target word. Bigram and tri-gram words that included a target word plus bigram and trigram POS categories that included the target word's POS category were used as features. In addition, bunsetsu boundaries as described in Section 4.1.1 were used. For example, when a target word was &quot;t&quot; in Fig. 3, &quot;</Paragraph> <Paragraph position="10"> fix&Noun&PPP&quot;, &quot;PPP&Verb&PPP&quot;, and &quot;Bunsetsu(Beginning)&quot; were used as features.</Paragraph> </Section> <Section position="2" start_page="0" end_page="93" type="sub_section"> <SectionTitle> 4.2 Results and Discussion </SectionTitle> <Paragraph position="0"> Results of the morphological analysis obtained by using morpheme models are shown in Table 3 and 4. In these tables, OOV indicates Out-of-Vocabulary rates. Shown in Table 3, OOV was calculated as the proportion of words not found in a dictionary to all words in the test corpus. In Table 4, OOV was calculated as the proportion of word and POS category pairs that were not found in a dictionary to all pairs in the test corpus. Recall is the percentage of morphemes in the test corpus for which the segmentation and major POS category were identified correctly.</Paragraph> <Paragraph position="1"> Precision is the percentage of all morphemes identified by the system that were identified correctly. The F-measure is defined by the following equation.</Paragraph> <Paragraph position="2"> ) 97.06 0% Tables 3 and 4 show that accuracies would improve significantly if no words were unknown. This indicates that all morphemes of the CSJ could be analyzed accurately if there were no unknown words.</Paragraph> <Paragraph position="3"> The improvements that we can expect by detecting unknown words and putting them into dictionaries are about 1.5 in F-measure for detecting word segments of short words and 2.5 for long words. For detecting the word segments and their POS categories, for short words we expect an improvement of about 2 in F-measure and for long words 3.</Paragraph> <Paragraph position="4"> Next, we discuss accuracies obtained when unknown words existed. The OOV for long words was 4% higher than that for short words. In general, the higher the OOV is, the more difficult detecting word segments and their POS categories is. However, the difference between accuracies for short and long words was about 1% in recall and 2% in precision, which is not significant when we consider that the difference between OOVs for short and long words was 4%. This result indicates that our morpheme models could detect both known and unknown words accurately, especially long words. Therefore, we investigated the recall of unknown words in the test corpus, and found that 55.7% (928/1,667) of short word segments and 74.1% (2,660/3,590) of long word segments were detected correctly. In addition, regarding unknown words, we also found that 47.5% (791/1,667) of short word segments plus their POS categories and 67.3% (2,415/3,590) of long word segments plus their POS categories were detected correctly. The recall of unknown words was about 20% higher for long words than for short words. We believe that this result mainly depended on the difference between short words and long words in terms of the definitions of compound words. A compound word is defined as one word when it is based on the definition of long words; however it is defined as two or more words when it is based on the definition of short words. Furthermore, based on the definition of short words, a division of compound words depends on its context. More information is needed to precisely detect short words than is required for long words. Next, we extracted words that were detected by the morpheme model but were not found in a dictionary, and investigated the percentage of unknown words that were completely or partially matched to the extracted words by their context. This percentage was 77.6% (1,293/1,667) for short words, and 80.6% (2,892/3,590) for long words. Most of the remaining unknown words that could not be detected by this method are compound words. We expect that these compounds can be detected during the manual examination of those words for which the morpheme model estimated a low probability, as will be shown later.</Paragraph> <Paragraph position="5"> The recall of unknown words was lower than that of known words, and the accuracy of automatic morphological analysis was lower than that of manual morphological analysis. As previously stated, to improve the accuracy of the whole corpus we take a semi-automatic approach. We assume that the smaller the probability is for an output morpheme estimated by a model, the more likely the output morpheme is wrong, and we examine output morphemes in ascending order of their probabilities. We investigated how much the accuracy of the whole corpus would increase. Fig. 5 shows the relationship between the percentage of output morphemes whose probabilities exceed a threshold and their precision. In this figure, &quot;short without UKW&quot;, &quot;long without UKW&quot;, &quot;short with UKW&quot;, and &quot;long with UKW&quot; represent the precision for short words detected assuming there were no unknown words, precision for long words detected assuming there were no unknown words, precision of short words including unknown words, and precision of long words including unknown words, respectively.</Paragraph> <Paragraph position="6"> When the output rate in the horizontal axis increases, the number of low-probability morphemes increases. In all graphs, precisions monotonously decrease as output rates increase. This means that tagging errors can be revised effectively when morphemes are examined in ascending order of their probabilities.</Paragraph> <Paragraph position="7"> Next, we investigated the relationship between the percentage of morphemes examined manually and the precision obtained after detected errors were revised. The result is shown in Fig. 6. Precision represents the precision of word segmentation and POS tagging. If unknown words were detected and put into a dictionary by the method described in the fourth paragraph of this section, the graph line for short words would be drawn between the graph lines &quot;short without UKW&quot; and &quot;short with UKW&quot;, and the graph line for long words would be drawn between the graph lines &quot;long without UKW&quot; and &quot;long with UKW&quot;. Based on test results, we can expect better than 99% precision for short words and better than 97% precision for long words in the whole corpus when we examine 10% of output mor- null tained after revising detected errors (when morphemes with probabilities under threshold and their adjacent morphemes are examined).</Paragraph> <Paragraph position="8"> ined morphemes.</Paragraph> <Paragraph position="9"> phemes in ascending order of their probabilities.</Paragraph> <Paragraph position="10"> Finally, we investigated the relationship between percentage of morphemes examined manually and the error rate for all of the examined morphemes.</Paragraph> <Paragraph position="11"> The result is shown in Fig. 7. We found that about 50% of examined morphemes would be found as errors at the beginning of the examination and about 20% of examined morphemes would be found as errors when examination of 10% of the whole corpus was completed. When unknown words were detected and put into a dictionary, the error rate decreased; even so, over 10% of examined morphemes would be found as errors.</Paragraph> <Paragraph position="12"> Results of the morphological analysis of long words obtained by using a chunking model are shown in Table 5 and 6. The first and second lines show the respective accuracies obtained when OOVs were 5.81% and 6.93%. The third lines show the accuracies obtained when we assumed that the OOV for short words was 0% and there were no errors in detecting short word segments and their POS categories. The fourth line in Table 6 shows the accuracy obtained when a chunking model without transformation rules was used.</Paragraph> <Paragraph position="13"> The accuracy obtained by using the chunking model was one point higher in F-measure than that obtained by using the morpheme model, and it was very close to the accuracy achieved for short words. This result indicates that errors newly produced by applying a chunking model to the results obtained for short words were slight, or errors in the results obtained for short words were amended by applying the chunking model. This result also shows that we can achieve good accuracy for long words by applying a chunking model even if we do not detect unknown long words and do not put them into a dictionary. If we could improve the accuracy for short words, the accuracy for long words would be improved also. The third lines in Tables 5 and 6 show that the accuracy would improve to over 98 points in F-measure. The fourth line in Tables 6 shows that transformation rules significantly contributed to improving the accuracy.</Paragraph> <Paragraph position="14"> Considering the results obtained in this section and in Section 4.2.1, we are now detecting short and long word segments and their POS categories in the whole corpus by using the following steps: 1. Automatically detect and manually examine unknown words for short words.</Paragraph> <Paragraph position="15"> 2. Improve the accuracy for short words in the whole corpus by manually examining short words in ascending order of their probabilities estimated by a morpheme model.</Paragraph> <Paragraph position="16"> 3. Apply a chunking model to the short words to detect long word segments and their POS categories. null As future work, we are planning to use an active learning method such as that proposed by Argamon-Engelson and Dagan (Argamon-Engelson and Dagan, 1999) to more effectively improve the accuracy of the whole corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>