File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1049_metho.xml
Size: 12,814 bytes
Last Modified: 2025-10-06 14:11:19
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1049"> <Title>TEXT PROCESSING OF THAI LANGUAGE =THE THREE SEALS LAW=</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> TEXT PROCESSING OF THAI LANGUAGE =THE THREE SEALS LAW= </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="332" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Computer softwares for processing Thai language are developed at National Museum of Ethnology,Osaka,Japan. We use a popular intelligent terminal TEKTRONIX 4051 for inputting and editing,IBM 370 model 138 for KWIC making and sorting, and CANON's laser beam printer for final output.</Paragraph> <Paragraph position="1"> Using these systems,&quot;Kotmai Tra Sam Duang&quot;(the Three Seals Law)which contains many kind of laws and ordinances proclaimed in Thai between 1350-1805 A.D. is computerized. This text has 1700 pages and about 1400000 letters. KWIC index becomes 200000 lines.</Paragraph> <Paragraph position="2"> Some statistical data for this text are obtained. They are occurrence frequency data of single letter,group vowel, and letter combination(digram),etc.</Paragraph> <Paragraph position="3"> Aknowledgements This report is a result of joint project at National Museum of Ethnology. The member are Y.Ishii, I.Akagi, S.Tanabe Y.Sakamoto, S.Uemura, A.Ishizawa, M.Sawamura, K.Sasaki, Y.Kurita, and S.Sugita. Their research field are ethnology,linguistics,computer science,and sociology etc.</Paragraph> <Paragraph position="4"> We thanks Mr. Sophon Chitthasatcha, Miss Sumalee Maungpaisaln and Miss Hiroe Matsumoto for their help in Segmentation, inputting and correction.</Paragraph> <Paragraph position="5"> We also thanks Prof. K.Nakayama and A.Oikawa of Tsukuba University for their support on making Thai letter patterns and output software for laser beam printer.</Paragraph> <Paragraph position="6"> Introduction In the field of ethnology or cultural anthlopology,ethnographies are very important information sources for comparative study of many different societies. Not only bibliographic data but also contents of text are necessary. HRAF(Human Relations Area Files), which was developed by Dr. Murdock and now managed by HRAF Inc. at Yale University,is a unique retrieval system. They use about 800 category codes by which analysts classify the contents of each pages of books.</Paragraph> <Paragraph position="7"> Though HRAF system is an elaborate work,it is not easy to search necessary data by user terms,that is,natural words. If whole text are fed into computer,it is very easy to retrieve any part of text by the same natural words used in the text.</Paragraph> <Paragraph position="8"> On-line retrieval system is smart and effective. But sometimes researcher wants printted index like as KWIC which is usable at any time and place. Combining KWIC index and thesaurus dictionary,it gives us a very powerful tools for searching special expression hidden in the text.</Paragraph> <Paragraph position="9"> Till quite recentry,at least in Japan,most cases of computer processing of natural language are distored to indo-europian language or Japanese. In the ethnological studies,we must treat many areas in the world. We need computer softwares which process unfamiliar languages for us,such as Arabic,Korean, Sumerian,Mongolian,Devanagari,Thai,etc. National Museum of Ethnology at Osaka has introduced several computer systems to encourage humanity study,and now is developing many application softwares which are usable by any researchers who do not know computer programming or how to use computer.</Paragraph> <Paragraph position="10"> This report describes one of such application softwares which treats Thai letters. The points of our work are as follows; i) A popular computer terminal is used for Thai letter inputting and editing. It is easy to use because dead key operation is not necessary.</Paragraph> <Paragraph position="11"> 2) KWIC making and sorting software are implemented using FORTRAN language which can be transfered to any other computer system. The algorithm is not so complex but it was not implemented only because they are not popular language.</Paragraph> <Paragraph position="12"> 3) Statistical data of the text are obtained. They are occurrence frequency of single letter,group vowel,and letter combination. These data will help us as a contexial data in case of OCR.</Paragraph> <Paragraph position="13"> - -330 * Seqmentation There is no segmentation problems in case of indo-europian languages,because they have clear separator for word unit such as space or comma. There are, however, many languages in Asia which have no clear separator. They are Korean (Hangul),Chinese,Japanese,and Thai,etc. Examples shown below mean that there exist several different segmentation.</Paragraph> <Paragraph position="14"> Segmentation affects to the meaning of sentence and retrieval efficiency.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Fig.l Examples of different segmentations </SectionTitle> <Paragraph position="0"> To cut into long unit is effort saving, but it is difficult to search the string included in that unit. To cut into short unit is effective for searching, but too many keywords appear. The text, the Three Seals Law, has no word separator, as shown in fig.3.</Paragraph> <Paragraph position="1"> So it is necessary to segment into appropriate units before making KWIC index. But it is difficult problem because segmentation needs well understanding of meaning, which conversely needs KWIC index.</Paragraph> <Paragraph position="2"> We adopted a practical method which at first cut into long unit and then cut again after looking KWIC index.</Paragraph> <Paragraph position="3"> I_ RE-'SEGI&quot;IEN TATIOt'~ i--'i</Paragraph> </Section> <Section position="2" start_page="0" end_page="332" type="sub_section"> <SectionTitle> Inputting Terminal </SectionTitle> <Paragraph position="0"> We use a popular intelligent graphic terminal TEKTRONIX 4051 which has usual alphabet keyboard. We sticked Thai letter labels on the side of each key as if it looks like Thai typewriter. A code table of Thai letters and coresponding english alphabets is shown in Table i.</Paragraph> <Paragraph position="1"> The characteristics of this termi- null nal are; i) It generates Thai letter pattern by BASIC program in graphic mode. User can affirm the letter he typed.</Paragraph> <Paragraph position="2"> 2) It has local cassett memory, so that user can input and edit data anytime, even when host computer is not working. 3) By way of communication line, stored data can be transmitted to host computer for time consuming work.</Paragraph> <Paragraph position="3"> 4) It is easy to implement a flexible Thai language editor, which accept alphabet commands and display Thai letters. 5) Copy of screen can be taken by the hard copy unit attached to it.</Paragraph> <Paragraph position="4"> Rules for text inputting The text has many irregular expressions. So following expediencies are adopted. null i) Quotated words or phrases from Pali language are skipped by inserting special symbol to indicate there are skipped words.</Paragraph> <Paragraph position="5"> 2)Tables are skipped.</Paragraph> <Paragraph position="6"> 3)Special expressions for money,dating, and fractional number are transformed into sentence form.</Paragraph> <Paragraph position="7"> 4)Vertical expression shown in Fig.3 are attached special symbol after and before the word.</Paragraph> <Paragraph position="8"> 5)Parallel expression in the middle of line,and tree like expression are transformed into linear form from which original form can be reconstructable as much as possible.</Paragraph> <Paragraph position="9"> Order of input The order of input of Thai letters to the computer is same with the order in which one would strike keys of type- null Correction Thai editor A line editor for Thai text is implemented on TEKTRONIX 4051 terminal. Commands are english like term and Thai text are displayed by Thai letter. This editor suporse that there are volume number,page number,and line number. null</Paragraph> </Section> </Section> <Section position="3" start_page="332" end_page="332" type="metho"> <SectionTitle> ENTER THE VOLUME NUMBER = v </SectionTitle> <Paragraph position="0"> ;specify volume number *PAGE,N ;specify page number. Until next page command,this page is held in memory.</Paragraph> <Paragraph position="1"> *LADD XX ;XX is added to this page as a last line *LINS,M XX;XX is inserted as a new line after line number m *LDEL,M ;line number m is deleted from this page *SHOW,M ;string of line number m is displayed in Thai letter *LGET,M ;line number m is object to be edited by following subcommands null *ADB XX ;XX is added to the beginning of line m *ADE XX ;XX is added to the end part of line m *DEL XX ;string XX is deleted from line m. If there are several XX's in line m, the position number ar e displayed. Enter corresponding number after prompt &quot;which?&quot; *INS XX BEFORE YY ;string XX is inserted before string YY. If several YY's are there,type corresponding number after prompt &quot;which?&quot; *REP XX BY YY ;string XX is replaced by YY.</Paragraph> </Section> <Section position="4" start_page="332" end_page="1186" type="metho"> <SectionTitle> KWIC making </SectionTitle> <Paragraph position="0"> The most obvious complication is the fact that in Thai writing as many as three separate characters can appear at the same holizontal position in four different vertical positions. Therefore number of letters to take as before or after context must be carefuly counted.</Paragraph> <Paragraph position="1"> As a index of every unit,volume number,page number and line number are attached to the left side.</Paragraph> <Paragraph position="2"> Sorting Sorting algorithm of Thai words is not so simple as English.</Paragraph> <Paragraph position="3"> l hai Computer algorithm i) Every occurrence of pre-positioned vowel ( & ~ I~ ) is moved to a position immediately -following consonant it preceeds.</Paragraph> <Paragraph position="4"> 2) Diacritic symbols are moved to the end of word with the indication of position counted from the end of word. 3) Each letter is replaced by the code given in Table i.</Paragraph> <Paragraph position="5"> 4) Then two words are compared as if they are numerals.</Paragraph> <Paragraph position="7"> We ignored algorism 2),because our segmentation units are not necessarily words so that it does not work effectively. null</Paragraph> <Section position="1" start_page="334" end_page="1186" type="sub_section"> <SectionTitle> Statistical data </SectionTitle> <Paragraph position="0"> Total number of letters in the machine readable text is 1362602 which include special symbols such as separator,skip symbol,comma,etc. Total line number is 29582. In Table 2 is shown letter occurrence frequency for each letter. Table 3 shows occurrence frequency of compound vowels. Combination frequency of two letters are listed in the highest frequency. The combination is taken as shown below.</Paragraph> <Paragraph position="1"> Fig.5 show a distribution of the ratio of upper and lower letters to the total number of letters in a line. Average ratio is 19%. A simple culculation give a ratio of 23% which is number of upper and lower letters among the horizontal positions. This means that in a line of Thai letter upper and lower letters is about 23% of normal horizontal positions.</Paragraph> <Paragraph position="2"> T=total number of letters in one line S=total number of upper and lower letters in the line M=T-S=number of horizontal positions in the line</Paragraph> <Paragraph position="4"> printer which can print out any kind of figure and characters. In a character mode,character must be defined as a dot matrix of 8X8,16Xi6,24X24,32X32,etc.</Paragraph> <Paragraph position="5"> We use 16X16 matrix as a minimum module of Thai letter pattern. Thai characters are classified into fifteen types from the size of dot matrix. The largest pattern has 48X32 matrix which uses 6 modules.</Paragraph> <Paragraph position="6"> One text line is printed by five horizontal zone. Each zone has 16 dot vertical width. The horizontal width of each letter can be changed character by character. But in a same zone,vertical size can not be changed.</Paragraph> <Paragraph position="7"> Control of different letter width The complex part of output program is to control the width so that heading part of KWIC index come in a line vertically. null An example of KWIC index is shown in Fig.6. We have printed about 200000 lines.</Paragraph> </Section> </Section> class="xml-element"></Paper>