File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1097_metho.xml
Size: 16,158 bytes
Last Modified: 2025-10-06 14:11:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1097"> <Title>CRITAC- A JAPANESE TEXT PROOFREADING SYSTEM</Title> <Section position="4" start_page="412" end_page="412" type="metho"> <SectionTitle> SQL/DS Dictionary Server </SectionTitle> <Paragraph position="0"> Online access to systmn dictionaries or an encyclopedia \[WEYEB8501\] is one of the most user-fi'iendly facilities in an advanced text processing system. CRITAC is connected to a dictionary server implemented on SQL/DS. SQL/DS (Structured Query Language/Data System \[IBM8308\]) is a relational \[CODD7006\] database management system. It has been mainly used for business data processing purposes like purchase-order files. The excellent user language of SQL/DS is based on the relational calculus which can be easily incorporated into Prolog \[IBM8509\]. Furthermore, SQL/DS can support multiple access to tables. For example, a user can access tables in terms of KANt1 values or PRONUNCIA-TION values. This greatly enhances the &quot;associative memory&quot; access of online dictionaries. That is, we can retrieve all the related information by giving some known valties. null The Japanese word processors currently have limited the nmnber of keys on the keyboard by using the phonetic character set, Hiragana or Katakana, to enter text. Once the pronunciation of a word or a phrase is given, it will he converted to the most-likely Kanji expression. This process is a Kana-to-Kanji conversion. Because of the large number of homonyms \[TANA8310\] in the Japanese language, this conversion is liable to generate an unintended Kanji expression. This is referred to as misconversion.</Paragraph> <Paragraph position="1"> We include some &quot;canned&quot; queries to the dictionary. A typical access pattern is searching for homonyms as shown in the section &quot;A Sample CRITAC Session&quot; (Figure 10), which helps users correct misconversions.</Paragraph> <Paragraph position="2"> Canned queries also include synonyms, antonyms, related words, and upper/lower concept of the given words. The conceptual hierarchy of words is obtained from the following combined SQL queries.</Paragraph> </Section> <Section position="5" start_page="412" end_page="413" type="metho"> <SectionTitle> SELECT FROM WHERE X.CATEGO RY -- NUMBER </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Get the stem number of the returned category - number.</Paragraph> <Paragraph position="3"> X.CATEGORY-NUMBER is like &quot;1.2.39&quot;. stemNum -- STEM(X.CATEGORY- NUMBER) ...................................................................... SELECT X.KANJI,X.YOMI, Y.CATEGORY-NUMBF, R FROM KANJI - PRIMITIVE - WORD - TABLE X, SEMANTIC-- CLASSIFICATION-- TABLE Y WHERE X.KANJI = Y.KANJI AND Y.CATEGRY - NUMBER lAKE 'stemNum%' ORDER BY CATEGORY2qUMBER The first query returns the category number, say &quot;1.2.39&quot;, from a table called &quot;SEMANTIC-CLASSIFICATION-TABLE&quot; We need a stem number (&quot;1&quot; in this case) of this category. The stem nmnber is then used to find the words whose category number has the same stem number. This is what the second query retrieves. The result is arranged by the category number. Note that users can make use of not only canned queries but also ad hoc queries. Since the SQL query language and the conceptual scheme of the dictionaries are easy to understand, users can make queries like those above to get ad hoc information. Such a set of customized queries is also stored in SQL/DS. Users can also define views for dictionaries to rename and select columns as well as records.</Paragraph> <Section position="1" start_page="412" end_page="413" type="sub_section"> <SectionTitle> The Text Compiler </SectionTitle> <Paragraph position="0"> CRITAC also provides a &quot;text compiler&quot; facility. It is analogous to the compilers of programming languages. The user gives text to this text compiler and the compiler can provide the following.</Paragraph> <Paragraph position="2"> A source list of text and diagnostic messages with line reference nmnbers.</Paragraph> <Paragraph position="3"> A list of segments and Kanji primitive words with their occurrence counts and pronunciation. A cross-reference list of words and text (KWOC: Key Word Out of Context) and other useful statistics can also be obtained. Various formats of the text including KWIC format. Additional information (word boundaries, pronunciation, etc.) may be added to the text. If an application uses a specific format, the text compiler can be made to generate application input in this tbrmat.</Paragraph> </Section> <Section position="2" start_page="413" end_page="413" type="sub_section"> <SectionTitle> Preprocessor </SectionTitle> <Paragraph position="0"> The preprocessor decomposes continuously typed Japanese text into a sequence of tokenized primitive words and particles. A basic object of Japanese sentence is a content word followed by zero or more function words, which is called a segment or a phrase (see \[MIYAG8310\], for example, for more details). This preprocess is required because Japanese text has no explicit delimiters (blanks) between words.</Paragraph> <Paragraph position="1"> The preprocessor also gathers such information as pronunciation 2, parts-of-speech, total number of occnrrences in the text and base form, etc., associated with each of the words and particles. This information together with the original text is stored as a set of Prolog facts to be used for later processing. We illustrate the steps of the preprocess (see Figure 4). 1. Japanese text is a collection of sentences. Each sentence is just a continuous string of characters.</Paragraph> <Paragraph position="2"> 2. A segmentation algorithm is applied to the sentence. This algorithm contains about 100 heuristic rules each of which specifies the cases where a segment boundary usually appears. The accuracy of this segmentation algorithm is about 97.5%.</Paragraph> <Paragraph position="3"> 3. Content words in the segments are recognized by looking them up in a primitive word dictionary. If a content word is a compound word, it is decomposed into primitive words. Since many Kanji compound words have ambiguities of decomposition, we apply a stochastic estimation algorithm and a Kanji primitive word dictionary with statistics \[FUJI8509\] to find the most likely decomposition. Our algorithm is obtained from stochastic estimation algorithms in \[FORN7303\], \[BAHLJ8303\], and \[FUJI8407\] with slight modification. The accuracy of our algorithm is about 96.5%.</Paragraph> <Paragraph position="4"> 4. Function words in each segment are identified. The connectivity of these function words is described by an automaton \[OKOC8112\]. A correct sequence of function words is obtained by observing the transition over this automaton.</Paragraph> <Paragraph position="5"> &quot;&quot;0:o,t j, .... 't (ease:OBd) (verb.eonj,par tiele) (ease:OBJ) (verb. con j) w : primitive word, p : prefix, s : suffix</Paragraph> </Section> </Section> <Section position="6" start_page="413" end_page="414" type="metho"> <SectionTitle> 2 Roughly speaking, Katakana and Hiragana characters are phonetic and each </SectionTitle> <Paragraph position="0"> of them corresponds to one phoneme. Kanji characters are ideographic and there is a many-to-many mapping between the Kanji character set and a set of phoneme sequences.</Paragraph> <Paragraph position="1"> The preprocess starts with given text at step 1 above. The text. is gradually analyzed and decomposed into fragments at the succeeding three steps. Details of the algorithms used here are beyond the scope of this paper.</Paragraph> <Paragraph position="2"> The above high level objects (segments and words) of Japanese text are conceptually expressed in terms of four types of facts and three types of predicates as shown in Figure 5. We map < segment >, < content word >, < function word>, and punctuations into the facts: seg0, head(), tail(), and punc0. Other fragments of text can be defined from those basic facts. This is called structured text.</Paragraph> <Paragraph position="3"> seg(1,J,K,X) A character string X is the K-th segment in the J-th sentence of the I-th paragraph.</Paragraph> <Paragraph position="4"> I,J and K denote the same indexes below.</Paragraph> <Paragraph position="5"> head(I,J,K,U,Y,G,L) U is a content word (possibly a Kanji compound word) of the segment X above, with pronunciation Y and part-of-speech G. L is a list of labels to denote prefixes, primitive words and suffixes in U if U is fbur predicates are facts tbat represent basic objects in Japanese text. The rest of predicates represent derived fragments of the text.</Paragraph> <Paragraph position="6"> The structured text enables us to generate two external views in the previous subsection (Figure 6). The mapping between the structured text and source view is straightforward. The The i-th row of the KWlC view is a sentence which has W,, segments. Segij is a shorthand notation for a segment X appeared in the i-th row of the KWIC view satisfying seg(pi, s,j, X) for some p~ and s~. Each sentence of the text appears in the KWIC view as many times as tile number of its segments because each segment has one keyword to appear in a distinct row in the view. For example, the first sentence appears three times in the KWIC view of Figure 6. Keyx equals the content word of Seg.~.k.~ and T~. is the remaining function words of Segx.kx. If Keyx is a compound word consisting of y primitive words p,, P2 ..... py, they are shown in y separate rows instead of a single row with Keyx.</Paragraph> <Paragraph position="8"> seg (1,2,8. g \[1'~ &quot;O'~') head (1,2,6, { H till ), { ~o ( &quot;C ~&quot; }, {noun }, {1,2} ). tai 1 (1,2,6, {~', ~J&quot; },22) and predicates including those used for the structured text, and built-in predicates \[IBM8509\]. There are two types of rules: source rules and KWIC rules. By source rules we mean those rules which involve qualification over segments, content words, and function words in the source view of text. The KWIC rules inw)lve the qualification of adjacent keywords as well as their ordering.</Paragraph> <Section position="1" start_page="414" end_page="414" type="sub_section"> <SectionTitle> Source Rules </SectionTitle> <Paragraph position="0"> One category of source rules aims at finding excessively complicated or ambiguous noun phrases. This is effective in the Japanese language environnrent because the Japanese grammar allows nouns and some particles to be strung together into a long sequence, and such sequences are used frequently. Basically we have two types of phrases to detect: one is repeated noun modifiers of the same kind and the other is an ambiguous dependency among segments. For example, phrases like</Paragraph> </Section> <Section position="2" start_page="414" end_page="414" type="sub_section"> <SectionTitle> My Mother's Father's Company's Location is... </SectionTitle> <Paragraph position="0"> (repetition of possesive noun phrase) Applying equations/1 or B and C, (too anrbiguons modification) are detected.</Paragraph> <Paragraph position="1"> Another category of source rules aims at detecting typographical errors and inconsistent use of words. Some sample proofreading rules are A rule for detecting incorrect ending of sentences</Paragraph> <Paragraph position="3"> This rule scans the function words of text. If a segment ends with a flmction word (the last: element of the list T in &quot;tail&quot;), which usually implies the &quot;end-of--sentence', but is not followed by a punctuation mark (period) E, then give a warning to the author.</Paragraph> <Paragraph position="4"> A rule for detecting an inconsistent use of numeric prefixes One is Alphabetical prefix and another is Chinese numeric prefix</Paragraph> <Paragraph position="6"> This rule detects the inconsistent usage of Kanji and Roman numeric prefixes for some content word. if some primitive word Wl is preceded by numeric prefixes P1 and P2 denoting the same numbers but not of the same character type, then give a warning to the author. &quot;Number-prefix(K,L,P,W)' succeeds if there is a number prefix P preceding a primitive word W in the compound word K, where the constituent types are listed in L. &quot;Char- type(P,T)&quot; succeeds if a 2-byte character string P consists of only one character type (&quot;Kanji&quot;, &quot;Hiragana&quot;, &quot;Kanakana&quot;, or &quot;Roman&quot;) or if P is &quot;mixed&quot;.</Paragraph> </Section> </Section> <Section position="7" start_page="414" end_page="415" type="metho"> <SectionTitle> KWIC Rules </SectionTitle> <Paragraph position="0"> KWIC rules, in contrast with the source rules, refer to keywords of the KWIC view rather than the &quot;segment&quot; or &quot;head&quot; above.</Paragraph> <Paragraph position="1"> In the KWIC view, as explained in the previous subsection, if Keyj ..... Keyk is the phonetic ordering, homonyms are arranged adjacently. This greatly reduces the time to detect homonym errors (conversion errors) because the system only has to scan the keywords Keyi once to examine the pronunciation of Key~ and Key~, and possibly some other local conditions. For example, we can express possible conversion errors as follows.</Paragraph> <Paragraph position="3"> If there are distinct keywords U1 and U2 with the same pronunciation Y1 in the same context P (preceding primitive word) and S (succeeding primitive word), then one of them is possibly a misconversion. &quot;Key(I,U,Y,G)&quot; is Key~ = U with pronunciation Y and lexical category G.</Paragraph> <Paragraph position="4"> &quot;Next-key(I,U,Y,G)&quot; unifies with key(I1,U,G,Y) such that II=I+l.</Paragraph> <Paragraph position="5"> If Key~, ..., Keyk are ordered either by their preceding or succeeding word, we can detect some lack of conformity in word usage. For example, if keywords are ordered by their succeeding words, the following case will be detected.</Paragraph> <Paragraph position="6"> ... be a spelling error in such cases .,.</Paragraph> <Paragraph position="7"> ... can find style errors as well as ,..</Paragraph> <Paragraph position="8"> ... to detect stylistic errors in the text ...</Paragraph> <Paragraph position="10"> & warningKWIC('inconsistency',II).</Paragraph> <Paragraph position="11"> Here we assume both stem('style','style') and stem('stylistic','style') are successful.</Paragraph> </Section> <Section position="8" start_page="415" end_page="416" type="metho"> <SectionTitle> 3. A Sample CRITAC Session </SectionTitle> <Paragraph position="0"> This section illustrates a typical CR1TAC session consisting of four actions. That is, 1. Detect an error in the source view (Figure 7). 2. Display the explanation of the error (Figure 8). 3. Locate the error candidate in the KWIC view (Figure 9). 4. Get the homonyms of a keyword by invoking the dic null tionary server (Figure 10).</Paragraph> <Paragraph position="1"> These actions are implemented as basic functions of the system. Locating a word of interes~ in different views will efficiently help users judge the system-detected errors to be real errors or not.</Paragraph> <Paragraph position="2"> explains that the error code 36 says that the underlined segment has a Kanji compound word with very rare combination of primitive words.</Paragraph> </Section> <Section position="9" start_page="416" end_page="416" type="metho"> <SectionTitle> CRITAC KVIC AI F 116 TR\[INC=116 SIZE=I207 LIRE=970 COL=I ALT=O </SectionTitle> <Paragraph position="0"> view shows the number of occurrences of a primitive word. The user wonders whether the primitive word caused the previous error 36 of the Kanji compound word.</Paragraph> <Paragraph position="1"> wind as a key, the dictionary server disphlys a list of its homonyms. Then the riser nfight find the correct primitive word m the list.</Paragraph> </Section> class="xml-element"></Paper>