File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-2135_metho.xml
Size: 17,311 bytes
Last Modified: 2025-10-06 14:12:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-2135"> <Title>A Computer Readability Formula of Japanese Texts for Machine Scoring</Title> <Section position="3" start_page="0" end_page="650" type="metho"> <SectionTitle> 2. Factors of Readability </SectionTitle> <Paragraph position="0"> We have chosen the following four surface characteristics as factors of readability: (1) relative frequency of characters for each type of characters, (2) the length of a run (maximal string that consists of one type of characters),.</Paragraph> <Paragraph position="1"> (3) the length of a sentence, and (4) the number of tooten s (commas) per sentence. The former two are related to the difficulty of vocabulary in a document; the latter two are related to the complexity of sentences in a document.</Paragraph> <Section position="1" start_page="649" end_page="649" type="sub_section"> <SectionTitle> Character Frequencies </SectionTitle> <Paragraph position="0"> The most common Japanese writing system is based on the mixture of kanzis, kanas (hiraganas and katakanas), the Roman alphabets, Arabic numerals, and some other alphabets and symbols. Almost al! normal writing is a mixture of kanzis, hiraganas, and katakanas (and others). Frequencies of types of characters in a Japanese text are known to affect its readability at least in the following manner: Kanzi, as mentioned before, are considered to make texts difficult. Since katakana and alphabets are used for foreign words, high frequencies of these characters indicate that the text contain many unfamiliar words. Hiragana are used to represent the rest of the text and more of them are considered to make texts easier. There is no rigid orthography for Japanese.</Paragraph> <Paragraph position="1"> Nevertheless, the way an adult Japanese spells out a sentence in usual writing is roughly fixed. Kanzis are used for nouns and for the root parts of verbs, adjectives, adverbs, and the like. Hiraganas are used to write inflections and other grammatical parts of sentences, and katakanas are used mainly for the transcription of foreign words. So in passages written in the common way, the use of types of characters, i.e., kanzi, hiragana, katakana, etc., reflects the use of vocabulary and can be an' indicator of the difficulty of the passage.</Paragraph> <Paragraph position="2"> It is possible to write the words usually written in kanzi in hiragana. However, psychological experiments such as the ones conducted by /Kitao 1960/ or /Hirose 1983/ a reader finds it difficult to read the texts represented in the way unfamiliar to the reader. In Kitao's experiment, subjects took less time to read and recognize the word or the sentence written in a common way than written solely in hiragana. In Hirose's experiment, the words usually written in kanzi are harder to recognize than the words usually written in kana when both type of words are written in kana. Both results show that words or sentences in the representation more - familiar to a reader are more readable than those in less familiar representation.</Paragraph> </Section> <Section position="2" start_page="649" end_page="650" type="sub_section"> <SectionTitle> Runs </SectionTitle> <Paragraph position="0"> In the ordinary representation, a boundary of the types of characters corresponds to the boundary of words or smaller grammatical parts thereof. That is, a series of letters of the same type in the text, bounded by other character types corresponds to a word or a smaller grammatical part. We will call such a series a run, i.e., a run is a maximal string that consists of only one type of characters.</Paragraph> <Paragraph position="1"> It is not a grammatical unit. Usually, a run corresponds to one or more words. A verb or an adjective is often found across two runs.</Paragraph> <Paragraph position="2"> Such a word norulally has its root part written in kanzi and its inflection part in hiragana.</Paragraph> <Paragraph position="3"> As the boundary of runs roughly correspond to the boundary of words, the different graphic appearance of kanzi and kana letters helps a reader to parse a sentence. Hence, long runs, when they happen, hide the word boundaries and makes a sentence less readable.</Paragraph> <Paragraph position="4"> Long kanzi runs give another problem to the readability.</Paragraph> <Paragraph position="5"> Kango. can be formed into a compound word simply by concatenating two or more of them successively. The meaning of the new word is formed by the meanings of its elements. However, how each element is related to each other in the compound word is not clear from mere concatenation. A reader must pragmatically see the relation. Therefore, it is often the case that the meaning of a compound kango is ambiguous. For example, siken-ki can be read as siken-suru-kikai (testing machine) or as siken-sareru-kikai (machine to be tested); rinzi-kyouiku-singi-kai meaning rinzi-nikyouiku-ni-tuite-singi-suru-kai (an ad hoe council to deliberate on education) can be read as rinzi-no-kyouiku-ni-tuite-singi-suru-kai (a council to deliberate on an ad hoe education).</Paragraph> <Paragraph position="6"> It is unlikely that there may be any good theory possible about the relationship between run frequencies and readability.</Paragraph> <Paragraph position="7"> Nevertheless, the run frequencies may be used in a similar manner as character frequencies. In a study preceding this/Tateisi 1987/ we found that the run frequencies are correlated with the frequencies of the character of corresponding types (0.6 < r < 0.9, depending on character types) and a unit of run is sufficient to obtain the information otherwise supplied by both characters and runs.</Paragraph> </Section> <Section position="3" start_page="650" end_page="650" type="sub_section"> <SectionTitle> Sentence Length </SectionTitle> <Paragraph position="0"> The length of sentences is a known factor of readability as /Morioka 1958/and other surveys show. In Japanese, as in other languages, long sentences tend to have complicated structures.</Paragraph> <Paragraph position="1"> Sentence length can be measured in the number of characters it contains. Though Sakamoto's survey of children's textbook /Sakamoto 1963/shows that the number of words per sentences is a more accurate indicator of the grade level than the number of characters, it also shows that the two are in good proportion, the correlation coefficient being 1.00.</Paragraph> <Paragraph position="2"> Punctuation Tootens, like commas, are put at the end of a phrase. The number of tootens per sentence corresponds to the number of phrases per sentence. /Hayasi 1959/found that junior high school sU:dents and senior high students understood the text more precisely if modifying phrases are separated and made into independent sentences. Following this result, a sentence with smaller number of phrase is easier to understand. /Kozuru 1987/ found that the average number of tootens in a sentence increases with student's grade level. These findings indicate that the number of tootens in a sentence is greater in more difficult-to-read texts. Thus, the number of tootens is a factor of readability.</Paragraph> </Section> </Section> <Section position="4" start_page="650" end_page="650" type="metho"> <SectionTitle> 3. The Method of Analysis </SectionTitle> <Paragraph position="0"> We shall first extract several numerical characteristics of style from texts and then derive a readability formula as a linear combination of the values of those characteristics. A nurflerical index is only a rough scale of readability. It should be calculated with simple devices and methods. We use character as the unit of measuring length for the sake of simple calculation.</Paragraph> <Paragraph position="1"> Several surface characteristics are extracted from the materials. Difference of the characteristics among materials consists of several factors. It may be factored into variation of the topic area of the texts, and the variation of style. Style may differ by the writer or by the intentions of the text. Introductory textbooks should be written easier than technical papers intended for experts and the authors will be careful not to make it difficult to read.</Paragraph> <Paragraph position="2"> Thus they will be written in a style easier to read than the style of technical papers. Translations tend to have a particular style, highly dependent on the syntax of the original language. The particular style of translations is often found awkward as Japanese and less readable. The distinctive feature of the texts with different intentions can be used as a criteria of assessing readability.</Paragraph> <Paragraph position="3"> To find the distinctive feature of texts from the surface characteristics, the principal component analysis (PCA) extracts factors of variance of the characteristics. We will then examine tile components, by comparing component scores for the materials with the empirical knowledges of readability. In this way we shall choose a component relevant to the stylistic readability. A principal component is a linear combination of the variables. The formula which computes the component can be used as a readability formula.</Paragraph> <Section position="1" start_page="650" end_page="650" type="sub_section"> <SectionTitle> Variables </SectionTitle> <Paragraph position="0"> We have chosen the ten variables that represent the four fac~ tots of readal~ility: (1) for each type of characters((Roman) alphabets, kanzis, hiraganas, katakanas), relative frequency of runs (maximal strings) that consists only of that type of characters, (2) the avelage number of letters per each type of runs, (3) the avelage number of letters per sentence, and (4) tooten lo kuten ratio.</Paragraph> <Paragraph position="1"> Sentence lengfl~ is measured in the number of characters between two adjacent sentence-ending marks (kuten, exclamation marks, and question marks). Kuten, unlike period, is placed only at the end of a sentence, not as an indicator of abbreviations. Therefore, the end of a sentence is ahnost always detected by detecting kuten, although the end quotation embedded in a sentence is also counted as the end of a sentence.</Paragraph> <Paragraph position="2"> Samples We must compare the readability anaong the texts written in tile common way, that is, the texts written by authors as they are. For exampl~, the textbooks for elementary school children are inadequate. This is because those textbooks are written in an unusual way. They use hiragana where most adults use kanzi, transcribing the kanzi the readers are not expected to learn yet. We will therefore take the documents written by adults for adults as materials of the analysis.</Paragraph> <Paragraph position="3"> Seventy-seven (77) documents were selected as sample texts to extract the data from. Seventy of the samples are machine-readable documents that were stored in our laboratory. They are technical papers, textbooks for collage students, and translations of computer science materials, written by 13 authors. Seven of the samples are included as indicators for reading ease. Five of these indicators a~e text judged as easy. Three of therft are taken from the books on technical writing; two are taken from essays for general readers. They are considered to be easier than the papers or textbooks for scientists. The remaining two are the text judged as difficult. One of them is a decision on the case of an infringement of copyright of a computer program; the other is a juridical paper about copyright and new media such as magnetic tapes. Juridical texts are empirically known as hard to read.</Paragraph> <Paragraph position="4"> Tables~ figures, references, and expressions which are displayed iitdependently from the passage are deleted from tile samples.</Paragraph> <Paragraph position="5"> 4. Result of the Principal Component Analysis (PCA) The plincipal component analysis is done by S routines /Becker 1984/on Vax 8600 at the Computer Center of the University of Tokyo. The components and the loadings of each variables are shown in table 4--1.</Paragraph> <Paragraph position="6"> The first three components (eigenvalue > 1) are examined.</Paragraph> <Paragraph position="7"> Total variance explained by these components is 70%. Figure 4-1 shows the s,:atter plot of sample texts. The letter i designate introductory textbooks, m magazine articles other than technical papers, p tt~chnical papers, t and T designate translations from English papers, and D and E designate the difficult and easy indicators, respectively.</Paragraph> <Paragraph position="8"> The following are observed for the first component.</Paragraph> <Paragraph position="9"> (1-1) This component reflects the occurrences of alphabets; separates the texts with little alphabetic content and the text aburtdant with alphabetic content.</Paragraph> <Paragraph position="10"> (1-2) The texts with many equations and abbreviations have high scor,~s on this component.</Paragraph> <Paragraph position="11"> The sc.0re on this component shows the area of topic.</Paragraph> <Paragraph position="12"> The following are observed for the second component.</Paragraph> <Paragraph position="13"> (2-1) This component separates the texts with long sentences and long kanzi runs from the other texts.</Paragraph> <Paragraph position="14"> (2-2) The component score agrees with human judgement about easy/difficult texts. It is high on tile texts judged easy and low on the texts judged difficult. The second component score shows the distinction more clearly than the first or the third.</Paragraph> <Paragraph position="15"> (2-3) Introductory textbooks have generally higher scores than papers. Again, the second component score shows the distinction more clearly than tile first or file third.</Paragraph> <Paragraph position="16"> Since long sentences and long kanzi runs make texts less readable as stated before, (2-1) indicates that the second component can be an indicator of readability. (2-2) and (2-3) also indicates that the second component is related to readability. The third component shows a difference of proportions of katakana and kanzi. From table 4-1 we can find that the variables on kanzi have positive loadings and the variables on hiragana and katakana have negative loadings on the component. Thus, the component shows the proportion of kanzi, in the way that it increases with texts with more kanzi.</Paragraph> </Section> </Section> <Section position="5" start_page="650" end_page="651" type="metho"> <SectionTitle> 5. Principal Component Scores and Style </SectionTitle> <Paragraph position="0"> We have observed tile following phenomena on the second component.</Paragraph> <Section position="1" start_page="650" end_page="651" type="sub_section"> <SectionTitle> Improvement and Principal Component Scores </SectionTitle> <Paragraph position="0"> Five of the sample texts are chapters (indicated T in the figure 4-1) of the final versions of the translation of an English paper by different translators. Their component scores were compared with those of tile respective draft versions. (The drafts are not among the samples.) The first three component scores of the final manuscripts were uniformly higher than those of drafts, i.e., tile scores became higher with the improvement of their style. The differences between the final versions and the respective draft versions are shown in table 5-1. The mean difference of the second eomponent is found greater than that of the first at the 5 percent significance (17 = 0.044) and greater than that of the third at the 10 percent significance but not at the 5 percent significance (19 = 0.098). Thus, the difference of the second component is greater than the other two. This agrees with the observations on the distribution of texts, that is, easier-to-read texts have higher second component score than difficult ones, since a text becomes easier to read after improvement in general.</Paragraph> <Paragraph position="1"> Frequencies of Passive Forms Table 5-2 below shows tile correlation between the component scores and the frequencies of passive. Passive forms are counted using the pattern matching method proposed by/Ushijima 1987/. The count is divided by the number of the kutens in a sample, yielding the ratio to passives per sentences, o1' sentenceendings. null Japanese passive forms are also used for potentials. For example, mirareru may mean either be seen (passive) o1' can see (potential) and taberareru may have one of three meanings: be eaten, can eat, and can be eaten. Thus, frequent use of passives tend to make a doc~ment vague and less readable.</Paragraph> <Paragraph position="2"> The second component scores have a higher correlation than other component scores. Note that the correlation coefficient is negative. This agrees with the observation that the second component score is lower on difficult-to-read texts and that the frequency of passives is higher on such texts.</Paragraph> <Paragraph position="3"> Figure 5-1 shows the plot of the second component scores and the frequencies of passives per 1000 sentences. The line in the figure is the regression line.</Paragraph> </Section> </Section> class="xml-element"></Paper>