File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/88/c88-2135_intro.xml
Size: 5,443 bytes
Last Modified: 2025-10-06 14:04:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-2135"> <Title>A Computer Readability Formula of Japanese Texts for Machine Scoring</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1o Introduction </SectionTitle> <Paragraph position="0"> This study aims to obtain a readability formula that can be used by computer programs for style checking of Japanese texts. A readability formula predicts the difficulty of a document that may result from its writing style, but not from its content, organization, or format. A readability index is calculated from the measures of surface characteristics of the document that are thought to indicate the stylistic difficulty without an attempt to parse sentences or to consult a large dictionary.</Paragraph> <Paragraph position="1"> Many of the readability formulae for English, (for example,</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Flesch's Reading Ease Score/Fleseh 1949/and Automated Reada- </SectionTitle> <Paragraph position="0"> bility Index/Smith 1970/), use the average length (number of syllables or lettors) of words and the average number of words in sentences in a document for calculating the readability index. Word length is a measure of the lexical difficulty, i.e., difficulty of the vocabulary used in the document. Sentence length is a measure of the Syntactic difficulty or complexity of the sentence. While reao dability indices are derived from simple formulae, they predict reasonably well the difficulty of a document. This is because the sentence length and the word length are highly correlated with features such as the complexity of the sentence and the difficulty of the word, respectively.</Paragraph> <Paragraph position="1"> Existing scoring methods for Japanese, such as the one proposed by/Morioka 1958/or/Yasumoto 1983/, use the sentenco length measured in letters instead of words and the percentage of kanzis (Chinese characters), the latter used for estimating the difficulty of the vocabulary. Both rate the average number of letters per sentence and the percentage of kanzis in the text independen|ly and do not combine the two factors into a single index. A te~t with longer sentences is estimated as difficult, and a text with more kanzis is also estimated as difficult. Morioka, who surveyed on school textbooks, showed that the upper grade text-books contain longer sentences on the average and more kanzi.</Paragraph> <Paragraph position="2"> Yasumoto states that documents with more kanzi are less readable even for adults, for the following reason. Kanzi are logograms, one roughly corresponding to a word. Documents using more kanzis, therefore, apt to include more different words and should demand more reading skill.</Paragraph> <Paragraph position="3"> A problem of rating the sentence length and the percentage of kanzi independently is that these two may yield an inconsistent rating. Generally, a sentence becomes longer if its kanzis are rewritten in kanas. Thus sentence lengths depend on representations.</Paragraph> <Paragraph position="4"> There seems to have been no attempt on combining the factors of sentence length and the proportion of kanzi. On the other hand, no rationale is given for the separate measurements. It is possible to derive a single index that can assess readability of Japanese text.</Paragraph> <Paragraph position="5"> /Sakamoto 1967/proposed a method of scoring the relative difficulty of children's books to match the reading skill of the intended readers. His method consists of three independent ratings; (1) the proportion of fundamental words based on/Sakamoto 1958/, (2) the proportion of sentences that are made of more than 10 words, and (3) the proportion of kanzi. However, Sakamoto's method introduce the problem of measuring sentence length in words in place of tile conflict between sentence length and representation.</Paragraph> <Paragraph position="6"> Using word count o1' word length as an estimator of readability is not practical in the case of Japanese. Since Japanese does not use word segmentations in nomaal writing, dividing sentences into words needs parsing and consulting dictionary. Thus, a scoring method based on words, such as Sakamoto's, is costly. This is especially so when scoring is done by a computer, because extra devices such as parsers, a large dictionary, and, sometimes, semantic analyzers are required for word segmentation alone.</Paragraph> <Paragraph position="7"> Another problem with the traditional scoring methods is that they have ignored katakana, which are used to represent foreign words. Recent documents, especially scientific and technical ones, use a lot of foreign words. /Watanabe 1983/reports that, in a year's issues of the Jonrual of lnfonnation Processing Society of Japan, Vol. 17, about an eighth among the characters used is katakana. /Satake 1982/surveyed the article of magazines published today and found that the ratio of katakana ranged from 4.44 to 13.75 percent. Thus percentage of katakana is not negligible in scoring today's documents. Katakana words mean imported foreign words, old and new, which are often unfamiliar to readers.</Paragraph> <Paragraph position="8"> Yet existing measures take into account only kanzi and are insufficient to score the today's technically oriented documents.</Paragraph> </Section> </Section> class="xml-element"></Paper>