File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1038_metho.xml
Size: 13,788 bytes
Last Modified: 2025-10-06 14:11:19
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1038"> <Title>LINGUISTIC ERROR CORRECTION OF JAPANESE SENTENCES</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Restrictions </SectionTitle> <Paragraph position="0"> This system imposes two restrictions on input data. One is that the input must consist of grammatical Japanese sentences in order that syntax analysis can be applicable. This system is not effective for only numeral data texts or a mere list of words.</Paragraph> <Paragraph position="1"> The other restriction is that the texts to be dealt with must be limited to a special field. By this restriction we can limit the number of terminologies which are used in the field.</Paragraph> <Paragraph position="2"> The corpus used for the experiment is 1,700 claims of patent gazettes of the Japan Patent Office. These gazettes concern the manufacturing technology of LSI devices for thirteen years (1964 - 1976). Figure 1 shows an example of them. This corpus includes 306,000 words. There are about 5 thousand different words in it. The distribution of the various categories are as follows; noun 3603 functional word 90 verb 832 suffix and prefix ii0 adjective 75 adverb 61 conjunction 30 As twenty thousand common words are added to these words, the dictionary contains about twenty-five thousand words.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Features of Patent Sentences </SectionTitle> <Paragraph position="0"> A Japanese written language consists of kanji, kana (hirakana and katakana) and alpha-numeric letters. Kana is a phonetic symbol and kanji is an ideograph. The kana set (either hiragana on katakana) consists of 48 letters.</Paragraph> <Paragraph position="1"> More than 2,000 different kanji letters are daily used.</Paragraph> <Paragraph position="2"> Japanese people write a sentence like one</Paragraph> <Paragraph position="4"> continuous string of letters with no space (see Fig. i). Japanese is different from western languages in this point. It is firstly important to identify words in the continuous string of letters to analyze a Japanese sentence.</Paragraph> <Paragraph position="5"> Figure 2 shows the construction of a pause group which is the minimum meaningful unit of a Japanese sentence. The prefix, the independent word and the suffix are usually written in kanji or katakana letters. The dependent word is written in kana letters. Changes of letter types as well as punctuation symbols give us useful clues to the boundaries where it is possible to separate a long letter string into shorter manageable units (pause group). This correction system detects words by using such conditions for these boundaries (Fig. 2).</Paragraph> <Paragraph position="6"> Experiments were conducted for the claim sentences of the patents. A claim sentence has a particular style. Most of the claim sentences consist of one sentence. An analytical study3 of the claim sentences showed that all sentences were categorized into 14 sentence patterns by coordinate phrases. The average count of words for a sentence is 180 words. The sentence is a big noun phrase and is constructed from many coordinate adjective or adverbial phrases which modify the same word.</Paragraph> <Paragraph position="7"> The claim sentence is so long that it is practical to analyze it on the basis of these phrases.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Kanji OCR and it's Errors </SectionTitle> <Paragraph position="0"> The large number of character categories as well as structual complexity of each character pattern are the dominant difficulties in kanji character recognition. A two-stage recognition method 4 has been developed to cope with these difficulties. This method employs an efficient candidate selection prior to a precise individual recognition. Fig. 3 shows a diagram of the two-stage recognition method.</Paragraph> <Paragraph position="1"> In the first recognition process stage, feature extraction is carried out on the input pattern. Candidate characters are obtained according to their geometrical features. In the second stage, pattern matchings are carried out between the input pattern and each reference pattern in selected candidate characters. The decision is made on the basis of their similarity values.</Paragraph> <Paragraph position="2"> The mutually similar patterns as well as low print quality cause recognition errors and rejections. These illegible letters have low similarity values. The recognition speed of this kanji OCR is i00 characters per second.</Paragraph> <Paragraph position="3"> More than 99 percent correct recognition rate was obtained for actual data. The average letter count of the claim sentences is 450 letters. Consequently this system encounters an illegible letter every second and three or four letters in a claim sentence.</Paragraph> <Paragraph position="4"> As the illegible letters have low similar- 258 ity values, this correction system Can find doubtful letters easily. If this error correction system checks all letters which are contained in a text, it needs much time to process. This error correction system picks out only the phrases which contain illegible letters, and analyze the grammatical legality of them. By this restriction this error correction system decreases the processing time and becomes a practical one.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. Error Correction Method </SectionTitle> <Paragraph position="0"> The error correction system has three analysis functions (Fig. 4).</Paragraph> <Paragraph position="1"> a) word analysis function b) syntax analysis function c) wording analysis function Two notations are used here. When one letter can not be recognized uniquely, the candidates for the letter are enclosed in parentheses. A letter which can not be recognized at all is expressed by a question mark.</Paragraph> <Paragraph position="3"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Word Analysis Function </SectionTitle> <Paragraph position="0"> In case of encountering an ambiguous letter in a sentence, the word analysis program searches the dictionary to find a grammatically and semantically valid candidate.</Paragraph> <Paragraph position="1"> For example, ' \[ /< /~ \] ~ -- ~ ~ (\[PA BA\]TANNINSIKI)' shows that the first letter is an ambiguous letter. In this case, two candidates are tested. A candidate ' /< ~ --~ (pattern recognition)' is meaningful but ' \]~ 9 -- > ~. ' is not. So \]~ is determined as the unique answer.</Paragraph> <Paragraph position="2"> Some Japanese letters resemble closely.</Paragraph> <Paragraph position="4"> The selection from these resembling patterns depends on their context. '- (a long vowel)' is reasonable for ;~ \[ .... \] ~ (BEESU, base).</Paragraph> <Paragraph position="5"> Two or more words are frequently connected without any conjunction or preposition in Japanese sentences. In this case the word analysis program calls the compound word analysis subprogram which looks up the word dictionary and makes a compound word from two or three words to analyze it. The above example is a compound word. ' \]~9 --~/~ (pattern Recognition)' is a compound word constructed from ' \]~--~ {pattern) and m~,~ (recognltlon).</Paragraph> <Paragraph position="6"> This subprogram has not only a full-string but a sub-string matching ability (Fig. 5).</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> KEIRYO GENGO GAKU </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> When a letter has no candidate letter, the word analysis program consults the dictionary and searches for words which fill up the illegible letter.</Paragraph> <Paragraph position="3"> In this case seven letters fill up the ?. The selection from these candidates is performed in the next syntax analysis step. This word analysis program is not valid for consecutive illegible letters. As most of the Japanese words are one or two Kanji letters, consecutive illegible letters do not give us any clue to search the dictionary. When we are given consecutive illegible letters '??', we can hardly guess what they are.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Syntax Analysis Function </SectionTitle> <Paragraph position="0"> When the word analysis is unsatisfactory to resolve the ambiguities, the syntax analysis is applied to them. In example 2, there are still seven candidates which were selected by the word analysis function. The syntax analysis program refers the contextual information. Example 3. 9 ~'~ (transparent terminal) o --259--This program first conducts a morphological analysis of the given pause group, and analyzes the syntactic role of each pause qroup in its phrase. A noun or verb does not conjugate like ' ~- (NA), and only an adjective verb can conjugate like ' ~ (NA)' So ' ~H~ ' is selected uniquely.</Paragraph> <Paragraph position="1"> Example 4. ~ \[~, hx\] ~ KABAN(GA KA)ARU _~.~ There is a base.</Paragraph> <Paragraph position="2"> In this case ' ~ (KIBAN; base)' in a noun, ' ~ (ARU; be)' is a verb, ' ~ (GA)' is a particle to indicate the subject, and ' ~% (KA)' is a particle to indicate an alternative or question. The particle ' ~ ' only makes the sentence grammatical and ' ~~O ' is the unique answer* This syntax analysis program performs the morphological analysis to the segmented pause groups (Fig. 2). If the segmentation is incorrect, this program can not analyze the phrase or sentence. So this program retries the segmentation of the input string to make successful analysis results.</Paragraph> <Paragraph position="4"> This example shows the retry process. The segmentation program firstly segments a string at the point of letter type changing (b). This segmentation is not correct. The first pause group is not grammatical. This program assumes that this pause group may be a compound pause group, and searches all possible separations from left to right. This program finds a embedded adverb ' ~ %%~L (SOREZORE; each)', and by this segmentation this sentence can be analyzed successfully. The other candidate ' ~ ' can make no grammatical sentence.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Wording Analysis Function </SectionTitle> <Paragraph position="0"> In the sentences of patent gazettes, important words or key words are repeatedly used with anapholic pronouns. This fact is a very important clue to find an anaphola or to guess the ambiguous letter. The arrows in Fig. 6 show the anapholic relations of words in a text. Some kinds of particular anapholic pronouns appear .</Paragraph> <Paragraph position="1"> . in patent texts (' ..J.~ (above-mentloned), ' ~ (GAI; such)', ' ~ (DOU; same)' and ' ~ (KONO; this\]). When an illegible letter occurs in an anapholic words, the wording analysis program searches the indicated word and correct the illegible letter by the matched letter. In Fig 6, ' ~.~:l~,9~ (above-mentioned connected area)' has an anapholic pronoun ' ~,~i '. ' ~'~ ' is compared with the indicated word ' ~ ' , and _? is corrected to ' ~j~ '. The wording analysis program automatically prints out a glossary of texts. This glossary is used to augment the dictionary of the error correction system* Numeric expressions are also used frequently. Numeric expressions are analyzed by using semantic relations of words in their vicinity. As the bibliography of a patent contains the name of a person, place and affillated organization, the correction system needs to change the dictionary from a common dictionary to a proper noun dictionary. In a proper noun pause group, it is more important to analyze the semantic relation among the words. 5 Example 6. (KAWASAKI city ~l~#~lJl,~ , ?~i KANAGAWA prefecture)</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> KANAGAWAKEN KAWASAKI SHI </SectionTitle> <Paragraph position="0"> PS /~ ~J (name of city) ~r~ I% (person's name) This phrase describes an address, and ' ~\] (city)' is a suffix for the name of a place which does not connect with a person's name. So the illegible letter can be decided uniquely.</Paragraph> <Paragraph position="1"> 6. System Configuration and Experimental Results error correction system. Fig. 8 shows the configuration of this system. The error correction system is programmed on a mini-computer (TOSBAC40). The text editing terminal is a newly developed Japanese word processor. The operator of this system can confirm the error correction results on the CRT display, change the form of the text by versatile editing functions, store and transfer them to the host machine.</Paragraph> <Paragraph position="2"> The experimental results for actual 250 patent texts were as follows; effective correction ....... 53.8 percent ineffective correction ..... 38.5 percent wrong correction ........... 7.7 percent The ineffective correction rate shows the percentage of letters which this system can not correct.</Paragraph> <Paragraph position="3"> This example shows a case of wrong correction. The first letter was illegible. And the next letter ' ~ (KEN)' was misread. The correct letter is ' ~q (KYO)' The kanji~OCR has made an error. This error correction system tried to correct the ? letter by using ' ~'I ' which was a wrong letter as the clue for correction, and made a wrong correction.</Paragraph> </Section> class="xml-element"></Paper>