File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1140_intro.xml
Size: 5,642 bytes
Last Modified: 2025-10-06 14:01:22
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1140"> <Title>Bringing the Dictionary to the User: the FOKS system</Title> <Section position="2" start_page="0" end_page="2" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Unknown words are a major bottleneck for learners of any language, due to the high overhead involved in looking them up in a dictionary. This is particularly true in non-alphabetic languages such as Japanese, as there is no easy way of looking up the component characters of new words. This research attempts to alleviate the dictionary look-up bottleneck by way of a comprehensive dictionary interface which allows Japanese learners to look up Japanese words in an efficient, robust manner. While the proposed method is directly transferable to other language pairs, for the purposes of this paper, we will focus exclusively on a Japanese-English dictionary interface.</Paragraph> <Paragraph position="1"> The Japanese writing system consists of the three orthographies of hiragana, katakana and kanji, which appear intermingled in modern-day texts (NLI, 1986). The hiragana and katakana syllabaries, collectively referred to as kana, are relatively small (46 characters each), and each character takes a unique and mutually exclusive reading which can easily be memorized. Thus they do not present a major difficulty for the learner. Kanji characters (ideograms), on the other hand, present a much bigger obstacle. The high number of these characters (1,945 prescribed by the government for daily use, and up to 3,000 appearing in newspapers and formal publications) in itself presents a challenge, but the matter is further complicated by the fact that each character can and often does take on several different and frequently unrelated readings. The kanji , for example, has readings including hatsu and ta(tsu), whereas a1 has readings including omote, hyou and arawa(reru). Based on simple combinatorics, therefore, the kanji compound a1 happyou &quot;announcement&quot; can take at least 6 basic readings, and when one considers phonological and conjugational variation, this number becomes much greater. Learners presented with the string a1 for the first time will, therefore, have a possibly large number of potential readings (conditioned on the number of component character readings they know) to choose from. The problem is further complicated by the occurrence of character combinations which do not take on compositional readings. For example a2 a3 kaze &quot;common cold&quot; is formed non-compositionally from a2 kaze/fuu &quot;wind&quot; and a3 yokoshima/ja &quot;evil&quot;. With paper dictionaries, look-up typically occurs in two forms: (a) directly based on the reading of the entire word, or (b) indirectly via component kanji characters and an index of words involving those kanji. Clearly in the first case, the correct reading of the word must be known in order to look it up, which is often not the case. In the second case, the complicated radical and stroke count systems make the kanji look-up process cumbersome and time consuming. null With electronic dictionaries--both commercial and publicly available (e.g. EDICT (2000))--the options are expanded somewhat. In addition to reading- and kanji-based look-up, for electronic texts, simply copying and pasting the desired string into the dictionary look-up window gives us direct access to the word.</Paragraph> <Paragraph position="2"> Although even here, life is complicated by Japanese being a non-segmenting language, putting the onus on the user to (e.g. Reading Tutor (Kitamura and Kawamura, 2000) and Rikai ) provide greater assistance by segmenting longer texts and outputing individual translations for each segment (word). If the target text is available only in hard copy, it is possible to use kana-kanji conversion to manually input component kanji, assuming that at least one reading or lexical instantiation of those kanji is known by the user. Essentially, this amounts to individually inputting the readings of words the desired kanji appear in, and searching through the candidates returned by the kana-kanji conversion system. Again, this is complicated and time inefficient so the need for a more user-friendly dictionary look-up remains.</Paragraph> <Paragraph position="3"> In this paper we describe the FOKS (Forgiving Online Kanji Search) system, that allows a learner to use his/her knowledge of kanji to the fullest extent in looking up unknown words according to their expected, but not necessarily correct, reading. Learners are exposed to certain kanji readings before others, and quickly develop a sense of the pervasiveness of different readings. We attempt to tap into this intuition, in predicting how Japanese learners will read an arbitrary kanji string based on the relative frequency of readings of the component kanji, and also the relative rates of application of phonological processes. An overall probability is attained for each candidate reading using the naive Bayes model over these component probabilities. Below, we describe how this is intended to mimic the cognitive ability of a learner, how the system interacts with a user and how it benefits a user.</Paragraph> <Paragraph position="4"> The remainder of this paper is structured as follows. Section 2 describes the preprocessing steps of reading generation and ranking. Section 3 describes the actual system as is currently visible on the internet. Finally, Section 4 provides an analysis and evaluation of the system.</Paragraph> </Section> class="xml-element"></Paper>