File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1140_evalu.xml
Size: 11,364 bytes
Last Modified: 2025-10-06 13:58:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1140"> <Title>Bringing the Dictionary to the User: the FOKS system</Title> <Section position="6" start_page="9" end_page="11" type="evalu"> <SectionTitle> 4 Analysis and Evaluation </SectionTitle> <Paragraph position="0"> To evaluate the proposed system, we first provide a short analysis of the reading set distribution and then describe results of a preliminary experiment on real-life data.</Paragraph> <Section position="1" start_page="9" end_page="9" type="sub_section"> <SectionTitle> 4.1 Reading set analysis </SectionTitle> <Paragraph position="0"> Since we create a large number of plausible readings, one potential problem is that a large number string lengths of candidates would be returned for each reading, obscuring dictionary entries for which the input is the correct reading. This could result in a high penalty for competent users who mostly search the dictionary with correct readings, potentially making the system unusable.</Paragraph> <Paragraph position="1"> To verify this, we tried to establish how many candidates are returned for user queries over readings the system has knowledge of, and also tested whether the number of results depends on the length of the query.</Paragraph> <Paragraph position="2"> The distribution of results returned for different queries is given in Figure 2, and the average number of results returned for different-length queries is given in Figure 3. In both figures, Baseline is calculated over only those readings in the original dictionary (i.e. correct readings); Existing is the subset of readings in the generated set that existed in the original dictionary; and All is all readings in the generated set. The distribution of the latter two sets is calculated over the generated set of readings. In Figure 2 the x-axis represents the number of results returned for the given reading and the y-axis represents the natural log of the number of readings returning that number of results. It can be seen that only a few readings return a high number of entries. 308 out of 2,194,159 or 0.014% readings return over 30 results. As it happens, most of the readings returning a high number of results are readings that existed in the original dictionary, as can be seen from the fact that Existing and All are almost identical for x values over 30. Note that the average number of dictionary entries returned per reading is 1.21 for the complete set of generated readings.</Paragraph> <Paragraph position="3"> Moreover, as seen from Figure 3 the number of results depends heavily on the length of the reading. In this figure, the x-axis gives the length of the reading in characters and the y-axis the average number of entries returned. It can be seen that queries containing 4 characters or more are likely to return 3 results or less on average. Here again, the Existing readings have the highest average of 2.88 results returned for 4 character queries. The 308 readings mentioned above were on average 2.80 characters in length.</Paragraph> <Paragraph position="4"> From these results, it would appear that the returned number of entries is ordinarily not overwhelming, and provided that the desired entries are included in the list of candidates, the system should prove itself useful to a learner. Furthermore, if a user is able to postulate several readings for a target string, s/he is more likely to obtain the translation with less effort by querying with the longer of the two postulates.</Paragraph> </Section> <Section position="2" start_page="9" end_page="11" type="sub_section"> <SectionTitle> 4.2 Comparison with a conventional system </SectionTitle> <Paragraph position="0"> As the second part of evaluation, we tested to see whether the set of candidates returned for a query over the wrong reading, includes the desired entry.</Paragraph> <Paragraph position="1"> We ran the following experiment. As a data set we used a collection of 139 entries taken from a web site displaying real-world reading errors made by native speakers of Japanese.</Paragraph> <Paragraph position="2"> For each entry, we queried our system with the erroneous reading to see whether the intended entry was returned among the system output. To transform this collection of items into a form suitable for dictionary querying, we converted all readings into hiragana, sometimes removing context words in the process. Table 3 gives a comparison of results returned in simple (conventional) and intelligent (proposed system) search modes. 62 entries, mostly proper names We have also implemented the proposed system with the ENAMDICT, a name dictionary in the EDICT distribution, nary look-up and our system character proverbs, were not contained in the dictionary and have been excluded from evaluation. The erroneous readings of the 77 entries that were contained in the dictionary averaged 4.39 characters in length.</Paragraph> <Paragraph position="3"> From Table 3 we can see that our system is able to handle more than 3 times more erroneous readings then the conventional system, representing an error rate reduction of 35.8%. However, the average number of results returned (5.42) and mean rank of the desired entry (4.71 -- calculated only for successful queries) are still sufficiently small to make the system practically useful.</Paragraph> <Paragraph position="4"> That the conventional system covers any erroneous readings at all is due to the fact that those readings are appropriate in alternative contexts, and as such both readings appear in the dictionary.</Paragraph> <Paragraph position="5"> Whereas our system is generally able to return all reading-variants for a given kanji string and therefore provide the full set of translations for the kanji string, conventional systems return only the translation for the given reading. That is, with our system, the learner will be able to determine which of the readings is appropriate for the given context based on the translation, whereas with conventional systems, they will be forced into attempting to contextualize a (potentially) wrong translation.</Paragraph> <Paragraph position="6"> Out of 42 entries that our system did not handle, the majority of misreadings were due to the usage of incorrect character readings in compounds (17) and graphical similarity-induced error (16). Another 5 errors were a result of substituting the reading of a semantically-similar word, and the remaining 5 a result of interpreting words as personal names.</Paragraph> <Paragraph position="7"> Finally, for the same data set we compared the relative rank of the correct and erroneous readings to see which was scored higher by our grading procedure. Given that the data set is intended to exemplify cases where the expected reading is different from the actual reading, we would expect the erroneous readings to rank higher than the actual readings. An average of 76.7 readings was created for allowing for name searches on the same basic methodology. We feel that this part of the system should prove itself useful even to the native speakers of Japanese who often experience problems reading uncommon personal or place names. However, as of yet, we have not evaluated this part of the system and will not discuss it in detail.</Paragraph> <Paragraph position="8"> the 34 entries. The average relative rank was 12.8 for erroneous readings and 19.6 for correct readings. Thus, on average, erroneous readings were ranked higher than the correct readings, in line with our prediction above.</Paragraph> <Paragraph position="9"> Admittedly, this evaluation was over a data set of limited size, largely because of the difficulty in gaining access to naturally-occurring kanji-reading confusion data. The results are, however, promising.</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> In order to emulate the limited cognitive abilities of a language learner, we have opted for a simplistic view of how individual kanji characters combine in compounds. In step 4 of preprocessing, we use the naive Bayes model to generate an overall probability for each reading, and in doing so assume that component readings are independent of each other, and that phonological and conjugational alternation in readings does not depend on lexical context. Clearly this is not the case. For example, kanji readings deriving from Chinese and native Japanese sources (on and kun readings, respectively) tend not to co-occur in compounds. Furthermore, phonological and conjugational alternations interact in subtle ways and are subject to a number of constraints (Vance, 1987).</Paragraph> <Paragraph position="1"> However, depending on the proficiency level of the learner, s/he may not be aware of these rules, and thus may try to derive compound readings in a more straightforward fashion, which is adequately modeled through our simplistic independence-based model. As can be seen from preliminary experimentation, our model is effective in handling a large number of reading errors but can be improved further. We intend to modify it to incorporate further constraints in the generation process after observing the correlation between the search inputs and selected dictionary entries.</Paragraph> <Paragraph position="2"> Furthermore, the current cognitive model does not include any notion of possible errors due to graphic or semantic similarity. But as seen from our preliminary experiment these error types are also common. For example, a0 a1 bochi &quot;graveyard&quot; and a3 a1 kichi &quot;base&quot; are graphically very similar but read differently, and a4 mono &quot;thing&quot; and a5 koto &quot;thing&quot; are semantically similar but take different readings.</Paragraph> <Paragraph position="3"> This leads to the potential for cross-borrowing of errant readings between these kanji pairs.</Paragraph> <Paragraph position="4"> Finally, we are working under the assumption that the target string is contained in the original dictionary and thus base all reading generation on the existing entries, assuming that the user will only attempt to look up words we have knowledge of. We also provide no immediate solution for random reading errors or for cases where user has no intuition as to how to read the characters in the target string.</Paragraph> </Section> <Section position="4" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.4 Future work </SectionTitle> <Paragraph position="0"> So far we have conducted only limited tests of correlation between the results returned and the target words. In order to truly evaluate the effectiveness of our system we need to perform experiments with a larger data set, ideally from actual user inputs (coupled with the desired dictionary entry). The reading generation and scoring procedure can be adjusted by adding and modifying various weight parameters to modify calculated probabilities and thus affect the results displayed.</Paragraph> <Paragraph position="1"> Also, to get a full coverage of predictable errors, we would like to expand our model further to incorporate consideration of errors due to graphic or semantic similarity of kanji.</Paragraph> </Section> </Section> class="xml-element"></Paper>