File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1140_metho.xml
Size: 12,193 bytes
Last Modified: 2025-10-06 14:07:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1140"> <Title>Bringing the Dictionary to the User: the FOKS system</Title> <Section position="3" start_page="2" end_page="4" type="metho"> <SectionTitle> 2 Data Preprocessing </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1 Problem domain </SectionTitle> <Paragraph position="0"> Our system is intended to handle strings both in the form they appear in texts (as a combination of the three Japanese orthographies) and as they are read (with the reading expressed in hiragana). Given a reading input, the system needs to establish a relationship between the reading and one or more dictionary entries, and rate the plausibility of each entry being realized with the entered reading.</Paragraph> <Paragraph position="1"> In a sense this problem is analogous to kana-kanji conversion (see, e.g., Ichimura et al. (2000) and Takahashi et al. (1996)), in that we seek to determine a ranked listing of kanji strings that could correspond to the input kana string. There is one major difference, however. Kana-kanji conversion systems correctly identify word boundaries.</Paragraph> <Paragraph position="2"> http://www.rikai.com are designed for native speakers of Japanese and as such expect accurate input. In cases when the correct or standardized reading is not available, kanji characters have to be converted one by one. This can be a painstaking process due to the large number of characters taking on identical readings, resulting in large lists of characters for the user to choose from. Our system, on the other hand, does not assume 100% accurate knowledge of readings, but instead expects readings to be predictably derived from the source kanji. What we do assume is that the user is able to determine word boundaries, which is in reality a non-trivial task due to Japanese being non-segmenting (see Kurohashi et al. (1994) and Nagata (1994), among others, for details of automatic segmentation methods). In a sense, the problem of word segmentation is distinct from the dictionary look-up task, so we do not tackle it in this paper.</Paragraph> <Paragraph position="3"> To be able to infer how kanji characters can be read, we first determine all possible readings a kanji character can take based on automatically-derived alignment data. Then, we machine learn phonological rules governing the formation of compound kanji strings. Given this information we are able to generate a set of readings for each dictionary entry that might be perceived as correct by a learner possessing some, potentially partial, knowledge of the character readings. Our generative method is analogous to that successfully applied by Knight and Graehl (1998) to the related problem of Japanese (back) transliteration.</Paragraph> </Section> <Section position="2" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 2.2 Generating and grading readings </SectionTitle> <Paragraph position="0"> In order to generate a set of plausible readings we first extract all dictionary entries containing kanji, and for each entry perform the following steps: 1. Segment the kanji string into minimal morphophonemic units and align each resulting unit with the corresponding reading. For this purpose, we modified the TF-IDF based method proposed by Baldwin and Tanaka (2000) to accept bootstrap data.</Paragraph> <Paragraph position="1"> 2. Perform conjugational, phonological and mor null phological analysis of each segment-reading pair and standardize the reading to canonical form (see Baldwin et al. (2002) for full details). In particular, we consider gemination (onbin) and sequential voicing (rendaku) as the most commonly-occurring phonological alternations in kanji compound formation (Tsujimura, 1996) . The canonical reading for a given seg- null A unit is not limited to one character. For example, verbs and adjectives commonly have conjugating suffices that are treated as part of the same segment.</Paragraph> <Paragraph position="2"> Inthepreviousexampleof a0 a1 happyou &quot;announcement&quot; the underlying reading of individual characters are hatsu and hyou respectively. When the compound is formed, hatsu seg- null ment is the basic reading to which conjugational and phonological processes apply.</Paragraph> <Paragraph position="3"> 3. Calculate the probability of a given segment being realized with each reading (P(r|k)), and of phonological (P phon (r)) or conjugational</Paragraph> <Paragraph position="5"> (r)) alternation occurring. The set of reading probabilities is specific to each (kanji) segment, whereas the phonological and conjugational probabilities are calculated based on the reading only. After obtaining the composite probabilities of all readings for a segment, we normalize them to sum to 1.</Paragraph> <Paragraph position="6"> 4. Create an exhaustive listing of reading candidates for each dictionary entry s and calculate the probability P(r|s) for each reading, based on evidence from step 3 and the naive Bayes model (assuming independence between all parameters). null</Paragraph> <Paragraph position="8"/> </Section> </Section> <Section position="4" start_page="4" end_page="4" type="metho"> <SectionTitle> 5. Calculate the corpus-based frequency F(s)of </SectionTitle> <Paragraph position="0"> each dictionary entry s in the corpus and then the string probability P(s), according to equation (3). Notice that the term</Paragraph> <Paragraph position="2"> pends on the given corpus and is constant for all strings s.</Paragraph> <Paragraph position="4"> (3) 6. Use Bayes rule to calculate the probability P(s|r) of each resulting reading according to equation (4).</Paragraph> <Paragraph position="6"> Here, as we are only interested in the relative score for each s given an input r, we can ignore P(r) and the constant</Paragraph> <Paragraph position="8"> ). The final plausibility grade is thus estimated as in equation (5).</Paragraph> <Paragraph position="10"> The resulting readings and their scores are stored in the system database to be queried as necessary.</Paragraph> <Paragraph position="11"> Note that the above processing is fully automated, a valuable quality when dealing with a volatile dictionary such as EDICT.</Paragraph> <Paragraph position="12"> ment undergoes gemination and hyou segment undergoes sequential voicing resulting in happyou surface form reading.</Paragraph> </Section> <Section position="5" start_page="4" end_page="9" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> The section above described the preprocessing steps necessary for our system. In this section we describe the actual implementation.</Paragraph> <Section position="1" start_page="4" end_page="7" type="sub_section"> <SectionTitle> 3.1 System overview </SectionTitle> <Paragraph position="0"> The base dictionary for our system is the publicly-available EDICT Japanese-English electronic dictionary. null We extracted all entries containing at least one kanji character and executed the steps described above for each. Corpus frequencies were calculated over the EDR Japanese corpus (EDR, 1995).</Paragraph> <Paragraph position="1"> During the generation step we ran into problems with extremely large numbers of generated readings, particularly for strings containing large numbers of kanji. Therefore, to reduce the size of generated data, we only generated readings for entries with less than 5 kanji segments, and discarded any readings not satisfying P(r|s) [?] 5 x 10 [?]5 . Finally, to complete the set, we inserted correct readings for all dictionary entries s kana that did not contain any kanji characters (for which no readings were generated above), with plausibility grade calculated by equation (6).</Paragraph> <Paragraph position="2"> The above set is stored in a MySQL relational database and queried through a CGI script. Since the readings and scores are precalculated, there is no time overhead in response to a user query. Figure 1 depicts the system output for the query atamajou. The system is easily accessible through any Japanese language-enabled web browser. Currently we include only a Japanese-English dictionary but it would be a trivial task to add links to translations in alternative languages.</Paragraph> </Section> <Section position="2" start_page="7" end_page="8" type="sub_section"> <SectionTitle> 3.2 Search facility </SectionTitle> <Paragraph position="0"> The system supports two major search modes: simple and intelligent. Simple search emulates a conventional electronic dictionary search (see, e.g., ) is assumed to be 1, as there is only one possible reading (i.e. r).</Paragraph> <Paragraph position="1"> Breen (2000)) over the original dictionary, taking both kanji and kana as query strings and displaying the resulting entries with their reading and translation. It also supports wild character and specified character length searches. These functions enable lookup of novel kanji combinations as long as at least one kanji is known and can be input into the dictionary interface.</Paragraph> <Paragraph position="2"> Intelligent search is over the set of generated readings. It accepts only kana query strings and proceeds in two steps. Initially, the user is provided with a list of candidates corresponding to the query, displayed in descending order of the score calculated from equation (5). The user must then click on the appropriate entry to get the full translation. This search mode is what separates our system from existing electronic dictionaries.</Paragraph> </Section> <Section position="3" start_page="8" end_page="9" type="sub_section"> <SectionTitle> 3.3 Example search </SectionTitle> <Paragraph position="0"> Let us explain the benefit of the system to the Japanese learner through an example. Suppose the user is interested in looking up a0 a1 zujou &quot;overhead&quot; in the dictionary but does not know the correct reading. Both a0 &quot;head&quot; and a1 &quot;over/above&quot; are quite common characters but frequently realized with different readings, namely atama, tou, etc. and ue, jou, etc., respectively. As a result, the user could interpret the string a0 a1 as being read as atamajou or toujou and query the system accordingly. Tables 1 and 2 show the results of these two queries.</Paragraph> <Paragraph position="1"> that the displayed readings are always the correct readings for the corresponding Japanese dictionary entry, and not the reading in the original query. For In order to retain the functionality offered by the simple interface, weautomaticallydefaultallqueriescontainingkanji characters and/or wild characters into simple search.</Paragraph> <Paragraph position="2"> Readings here are given in romanized form, whereas they appear only in kana in the actual system interface. See Figure 1 for an example of real-life system output.</Paragraph> <Paragraph position="3"> all those entries where the actual reading coincides with the user input, the reading is displayed in boldface. null From Table 1 we see that only two results are returned for atamajou, and that the highest ranking candidate corresponds to the desired string a0 a1 . Note that atamajou is not a valid word in Japanese, and that a conventional dictionary search would yield no results.</Paragraph> <Paragraph position="4"> Things get somewhat more complicated for the reading toujou, as can be seen from Table 2. A total of 14 entries is returned, for four of which toujou is the correct reading (as indicated in bold). The string a0 a1 is second in rank, scored higher than three entries for which toujou is the correct reading, due to the scoring procedure not considering whether the generated readings are correct or not.</Paragraph> <Paragraph position="5"> For both of these inputs, a conventional system would not provide access to the desired translation without additional user effort, while the proposed system returns the desired entry as a first-pass candidate in both cases.</Paragraph> </Section> </Section> class="xml-element"></Paper>