File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1075_metho.xml
Size: 6,349 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1075"> <Title>Development of Computational Linguistics Research: a Challenge for Indonesia</Title> <Section position="2" start_page="1" end_page="3" type="metho"> <SectionTitle> 2 Past Research Activities </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.1 Corpus Analysis </SectionTitle> <Paragraph position="0"> Corpus analysis is an important means as a way to understand the evolution of language usage by its people. In the case of Bahasa Indonesia, research activities on corpus analysis were almost none. There was one work by R. R.</Paragraph> <Paragraph position="1"> Hardjadibrata (1969) from Monash University, who conducted word frequency analysis of Indonesian newspapers. There was also similar work conducted the MMTS project (will be described later, in the following section); however, the result of the group's corpus analysis was not made public.</Paragraph> <Paragraph position="2"> Given this condition, with a group of colleague both from the Faculty of Computer Science and the Faculty of Letters, I conducted an Indonesian corpus analysis using newspapers as the text source. We collected 52 editions of Kompas, a national newspaper with a large number of readers, published in the year of 1994. Each of the 52 editions corresponds to a particular week of the year and was taken randomly from the 7 daily editions of that given week. From this collection, we constructed a corpus consisting of 2.200.818 words that were formed by 74.559 unique words. Of these more than 2 million words, 1.826.740 words that were formed by 27.738 unique words are actually words that matched with the KBBI entries, while the rest are either names or foreign words. Detailed analysis can be found in Muhadjir (1996).</Paragraph> <Paragraph position="3"> KBBI (Kamus Besar Bahasa Indonesia), the standard word dictionary for Bahasa Indonesia, contains a little more than 70.000 word entries.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.2 Morphological Analysis </SectionTitle> <Paragraph position="0"> Everyone who has used a word processor understands the importance of a spelling checker in helping him/her to produce an error-free document. To develop a spelling checker, we need to understand the morphological structure of words especially how derived-words are constructed from their root-words and the addition of affixes.</Paragraph> <Paragraph position="1"> We have conducted research to analyze the morphological structure of Indonesian words and based on this analysis we have developed a stemming algorithm suitable for those words.</Paragraph> <Paragraph position="2"> Unlike English, where the role of suffix dominates the generation of derived-words, Bahasa Indonesia depends on both prefix and suffix to derive new words. Therefore, to stem a derived Indonesian word in order to obtain its root-word, we have to look at the presence of both prefix and suffix in that derived-word (Nazief, 1996). In addition, similar to English, multiple suffixes can also be present on a given derived-word.</Paragraph> <Paragraph position="3"> Based on this stemming algorithm, we have developed a spelling checker and spelling-error corrector utilities as part of the Lotus</Paragraph> </Section> <Section position="3" start_page="2" end_page="3" type="sub_section"> <SectionTitle> Smartsuite </SectionTitle> <Paragraph position="0"> package.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 2.3 The MMTS Project </SectionTitle> <Paragraph position="0"> One notable research activity among the few computational linguistics research activities in Indonesia is the Multilingual Machine Translation System (MMTS) project conducted by the Agency for Assessment and Application of Technology (BPPT) as part of multi-national research project between China, Indonesia, Malaysia, Thailand, and lead by Japan (see http://www.cicc.or.jp/homepage/english/about/a ct/mt/mt.htm, http://www.aia.bppt.go.id/mmts).</Paragraph> <Paragraph position="1"> Unfortunately, there are very few publications about this work that could have benefited the computational linguistic community in the country. One of the few publications that the MMTS project made available for public is the Indonesian Word Electronic Dictionary (KEBI), which could be accessed on-line on http://nlp.aia.bppt.go.id/. The dictionary contains Lotus Smartsuite is an office automation package consisting word processor, spreadsheet, presentation editor, and database applications developed by Lotus</Paragraph> </Section> <Section position="5" start_page="3" end_page="3" type="sub_section"> <SectionTitle> Development Corporation. </SectionTitle> <Paragraph position="0"> 22.500 root-word and 43.500 derived-word entries.</Paragraph> </Section> </Section> <Section position="3" start_page="3" end_page="3" type="metho"> <SectionTitle> 3 Understanding Indonesian Grammar </SectionTitle> <Paragraph position="0"> Currently, I am concentrating my work on developing syntax analyzer for sentences written in Bahasa Indonesia. The approach taken initially was to use the context free grammar with restriction such as that used in the linguistic string analysis (Sager, 1981). Using this approach, we have developed grammar that understands declarative sentences (Shavitri, 1999). However, our experience shows that we need to have a more detailed word categories than is currently available in the standard Indonesian word dictionary (KBBI) before the grammar can be used effectively.</Paragraph> <Paragraph position="1"> This finding really shows us the importance of collaborating with the linguists who understand this field better. But before we do this, we need to educate our linguist-fellows the importance of computer in their fields.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Acknowledgements </SectionTitle> <Paragraph position="0"> I would like to thank Mirna, bu Multamia, pak Muhadjir, bu Kiswartini, and all of my students who have collaborated with me in these efforts to understand Bahasa Indonesia better.</Paragraph> </Section> class="xml-element"></Paper>