File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2185_metho.xml

Size: 18,138 bytes

Last Modified: 2025-10-06 14:14:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2185">
  <Title>Engineering Platform): An Open Architecture for Language Engineering. CEC and Cray Sys-</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Korean Language Engineering
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Language Engineering
</SectionTitle>
      <Paragraph position="0"> Language engineering is slich an activity that implements various fnnctions related to a language and builds lip an information base. It realizes linguistic activities of everyday life and linguistic competence of human beings with the aids of computer science, thereby supporting people's intellectual linguistic productions. The language engineering not only collects and disseminates tile informat, ion and knowledge of ~t language, among the linguistic society but also serves as a Youndation on which linguistic culture and ~echnologies can be based (Oh et al., 1994).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Korean Language Engineering
</SectionTitle>
      <Paragraph position="0"> Korean language engineering is one for Korean language. It came into birth in early 1980's with the emergence of personal comt)uters (PCs). hi *'Fhis work is fimded by Ministry of Science and 7~,clmology and Ministry of Education and Athletics, as a part of a contract by Center for Korean Language Engineering.</Paragraph>
      <Paragraph position="1"> the beginning, they focused on Korean alphabets and sonw scrappy parts of character processing, lacking the global view of the engineering approaches. Technical approaches to Korean began with the formation of the special interest group on Korean information processing under tile Korea Information Science Society. And in 1994 (;enter for Korean Language Engineering (KLE) was founded to serve as a centrM organization for Korean language engineering, which aims to plan and progranl related projects an(i works in a consistent, systeinlttic way with long-teiun gems. It also incorporates academic and research institutes and hidustries into comnion goals: the etticient and imrmonious (lriw~ toward research and development, and establishment of long-range policies and strategies for Korean la.ngu~tge engineering.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="1049" type="metho">
    <SectionTitle>
2 Areas of Korean Language
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="1049" type="sub_section">
      <SectionTitle>
Engineering Researches
</SectionTitle>
      <Paragraph position="0"> According to the level of technologies, KLE partitioned its projects into ttiree classes.</Paragraph>
      <Paragraph position="1"> Fundamental technology deals with radical and theoretical researches, collection and nlanipulation of data, and standardization. In linguistic viewpohlt, these include language \[ornialisms, text corpora, and statistical int'ormation of a language. On infornlation enginee.ring side, the technology covers information interchange and compression techniques, basic techniques of artificial intelligence such as knowledge representation, searching, and tools for manipulating Korean alphabets. From the cognitive engineering point of view, the research focuses on the structure of Korean alphabets, fonts, command structures, and interdisciplinary works of cognitive science. A/&gt; o~her division handles standardization issues for code schemes and w)cabnlaries, keyboard layout, standard text formats, and internationalization.</Paragraph>
      <Paragraph position="2"> &amp;quot;Pile second class is called basic technology, which is related to the basic software libraries for Korean language processing. Included in this class are natural language analysis, pattern recognition, multimedia data base, and data conversion tools.</Paragraph>
      <Paragraph position="3"> The third class is applications technology. It  consists of systems for text interchange and compression, hypertext, multimedia, word processing and others. For knowledge processing, it will cover document paraphrasing, indexing and retrieval, computer-based instruction/education, etc.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1049" end_page="1049" type="metho">
    <SectionTitle>
3 Information Platform
</SectionTitle>
    <Paragraph position="0"> For Korean language engineering, it is necessary to develop systematically all the projects of each area and integrate them into a uniform frame, called an information platform (IP). 1 KLE programs each project according to its priority and state-of-the-art technology. Consequently, \]P reflects the status of ongoing projects and is an as-is framework on which further researches and development works can be performed.</Paragraph>
    <Paragraph position="1"> Figure 1 shows the conceptual diagram of IP.</Paragraph>
    <Paragraph position="2"> This platform doesn't integrate all the project outcomes but some of the 5mdamental resources and basic tools, since it reflects the current configuration that is not concrete but open to changes. The whole integration of the project outcomes will be available at the end of the first phase in 1997.</Paragraph>
    <Paragraph position="3"> This platform is different from ALEP (Advanced Language Engineering Platform) (Simpkins, 1994) in that ALEP is an environment that can be provided to users as a form of a (customizable) package whereas our platform is a server-client model in pursuit of a web-based service for resources and tools.</Paragraph>
    <Paragraph position="4"> Worldwide web is composed of hyperdocuments and hyperlinks to handle multimedia data as well as to provide easy and timely access to electronic information. It uses hypertext markup language (HTML) based on standardized generalized markup language (SGML). Therefore, it guarantees the standardization and straightforward de-</Paragraph>
    <Paragraph position="6"> SunOS, version 1 platform and web pages are only in Korean. The 2nd version will be released on Solaris at the address &amp;quot;http://kibs.kaist.ac.kr/KLE/KIBS/.&amp;quot; sign characteristics, which lead to the ease of system design and tlexibility of the system config~ rations (Berners-Lee ~5 Connolly, 1993). Its other characteristic lies in the common gateway interface (CGI) which makes it possible to interface with various shell scripts and program codes without difficulties. Yet another point is that the server-client model makes the platform transparent to the users.</Paragraph>
    <Paragraph position="7"> IP consists of three parts. First, text corpora, voice and handwritten scripts DBs, dictionaries and a set of terminological DBs constitute the information base. The information base may directly be distributed through ftp server or indirectly accessed by the language tools on the higher layer of the http server configuration.</Paragraph>
    <Paragraph position="8"> Secondly, language tools are running on the http server with the aids of CGI as well as being ftp-ed to users as executable codes. Since we aim to provide software versions on Unix, Solaris, and PC Windows altogether, initial hardware requirements for each tool may be different. ~ Finally, documentation preparation will also be accompanied with the project's progress.</Paragraph>
  </Section>
  <Section position="5" start_page="1049" end_page="1049" type="metho">
    <SectionTitle>
4 Information Base
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1049" end_page="1049" type="sub_section">
      <SectionTitle>
4.1 Text Corpus
</SectionTitle>
      <Paragraph position="0"> Text corpora are essential to statistical modeling, in developing formal theories of the grammars, investigating prosodic phenomena in speech, and evaluating or comparing the adequacy of parsing models (Marcus et al., 1993). There are four sorts of corpora from contemporary Korean texts.</Paragraph>
      <Paragraph position="1"> * Raw corpus Two factors are the genre of each source text that is related to the objective(s) in using the corpus, and the category of the text that represents the internal structure of the text.</Paragraph>
      <Paragraph position="2"> Major sources of the corpus inchlde books, magazines, and newspapers; up to date three million word phrases are gathered.</Paragraph>
      <Paragraph position="3"> * Part-of-speech (POS) tagged corpus POS tagset for Korean originated from (Kiln L~ Seo, 1994). In version 1 platform we yielded 2.5 million automatically tagged word phrases and 1.5 million post-edited word phrases.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1049" end_page="1050" type="metho">
    <SectionTitle>
* Tree-tagged corpus
</SectionTitle>
    <Paragraph position="0"> This can be produced by applying syntactic tagset to the POS tagged corpus. The syntactic tagset is being studied using 100,000 sentences out of POS tagged corpus, and the resultant tree-tagged corpus using a tree tagger will appear at the end of this year.</Paragraph>
  </Section>
  <Section position="7" start_page="1050" end_page="1050" type="metho">
    <SectionTitle>
* Categorized corpus
</SectionTitle>
    <Paragraph position="0"> Korean verbs and adjectives are classified into over seventy categories, and a set of sentence styles are investigated for 940 basic verbs of those categories. About thirty five thousand sentences are tangible in version 1 platform.</Paragraph>
    <Section position="1" start_page="1050" end_page="1050" type="sub_section">
      <SectionTitle>
4.2 Voice Data Base
</SectionTitle>
      <Paragraph position="0"> This resource can be used \['or speech recognition and synthesis applications. We initially focused on word-level voice data. It includes phonetically balanced words, phonemic sequences pronounced by four different speakers, and narration of sample stories. It also stores the sounds of single syllables, diphones, numerics, high-frequency words, gazetteers, flmctional words, and consecutive word sequences. The data are stored in server disks and CD-ROMs as a wave form. This effort will be extended to sentence-level collections such as phonetically balanced sentences, speech dialogues, and scenarios.</Paragraph>
    </Section>
    <Section position="2" start_page="1050" end_page="1050" type="sub_section">
      <SectionTitle>
4.3 Handwritten Scripts Data Base
</SectionTitle>
      <Paragraph position="0"> Since character recognition systems are under the control of applications engineers, the objective of this work is to provide well-tbrmed data and evaluation criteria for those recognition systems. We stepwise our data collection into three phases: to scan, with 300 dpi resolution, one thousand sets of 590 high-frequency syllables in the first year, then of 990 syllables and 2,350 syllables in the following years, a At each phase, we develop both the square-hand (:haracters and free-style characters.</Paragraph>
    </Section>
    <Section position="3" start_page="1050" end_page="1050" type="sub_section">
      <SectionTitle>
4.4 Dictionaries and Terminological Data
Base
</SectionTitle>
      <Paragraph position="0"> * Multilingual technical dictionary The objective is to set up mappings between technical terms of Korean and other langnage(s) in both directions. '\['he first work is done for computer science domain, and it has 35,000 entries each for Korean and English. It will be extended to cover Chinese, Japanese, and German as well as more domains including electrical/electronic engineering, medical science, law, etc.</Paragraph>
      <Paragraph position="1"> (r) Monolingual terminology data bank Users need definitions and explanations of technical terms during their work on specific domains. This work provides users such terminological details. We assorted 15,000 entries each for culture/art and Korean classical literature.</Paragraph>
      <Paragraph position="2"> - Ontology-based lexicon Currently awnlable dictionaries are semantically oriented. They don't provide pools 3It is possible to compose up to 11,172 syllables out of ea&lt;:h Korean alphabet, but Korea/, Standard Code KSC-:5601 prescribes 2,350 complete codes for Korean syllables.</Paragraph>
      <Paragraph position="3"> of target language expressions but offer basic meanings for entries together with some syntactic and morphological information. Ontology-based lexicon is lexically oriented in that it guides the user to find a pragmatically or contextually equivalent expression corresponding to the source language expression. The work is on the phase of feasibility study with intensive locus on collecting Korean-English bilingual information sources and developing tools for lexicon construction.</Paragraph>
      <Paragraph position="4"> Lexicon for morphological analysis The lexicon for Korean morphological analysis is currently being built to have 30,000 entries with oil'-line management tools, and will grow to 100,000 entries with on-line tools after two more years. 4</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="1050" end_page="1051" type="metho">
    <SectionTitle>
5 Language Engineering Tools
</SectionTitle>
    <Paragraph position="0"> Basically, the tools that we present here are for text corpus and dictionaries, except for voice and character recognizers. The latter two programs are currently under the develol)ment and will be integrated later.</Paragraph>
    <Section position="1" start_page="1050" end_page="1050" type="sub_section">
      <SectionTitle>
5.1 Morphological Analyzer
</SectionTitle>
      <Paragraph position="0"> MorI&gt;hological analysis is an important but dilfic,lt t)art of the analysis since Korean is an agglutinative language with sophisticated morpheme segmentation rules and morphotactic rules. The n\]or.phological analyzer is based on the Korean chart parsing (Lee, 1993). Its' current precision is over 92 percent for the grammatical inl)ut sentences. It aims to achieve 98 percent accuracy in two nrore years. It will be extended to cover special symbols, alien strings, elliptical or abbreviated words, and spell errors to earn higher accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="1050" end_page="1051" type="sub_section">
      <SectionTitle>
5.2 Tagger
</SectionTitle>
      <Paragraph position="0"> Because the output of morphological analysis is rather complex due to the characteristics of Korean, the use of a tagger to reduce ambiguities seems important for further processing. (Shin et al., 1995) adopts the hidden Markov model and takes into account the characteristics of Korean word phrase structures for more accurate tagging: a word phrase contains one or more roof phemes, syntactic information (grammatical relations by bound morphemes), and semantic infof mation (case roles by postpositions). The experiments revealed 98 % accuracy for the test set of 5,500 word phrases out of 55,000 training data, and 94.7 % tbr 5,500 untrained test data.</Paragraph>
      <Paragraph position="1"> ~We can conceive much nlore types of dictionaries: for example, lexicons for syntactic attd semantic analyses, and dictionaries tha.t are to be created or extracted from existing ones upon users' or developers' nee(Is. These will be i,clhded after the tirst phase of the project, following future direction of the project.</Paragraph>
      <Paragraph position="2">  Another approach is based on the Markov random field (MRF) theory (Jung, 1996), whose Korean version will be added to IP this year.</Paragraph>
    </Section>
    <Section position="3" start_page="1051" end_page="1051" type="sub_section">
      <SectionTitle>
5.3 Tree Tagger
</SectionTitle>
      <Paragraph position="0"> (Kim, 1995) is a prototype using dependency grammar and adopting statistical methods for ranking the parse trees to get k-best parsing results. Its current accuracy is about 80 % for the trained data. While this is a working prototype, we need a tree tagger with better performance so that another tree tagger using partial parsing method (Abney, 1991) is on breadboard.</Paragraph>
    </Section>
    <Section position="4" start_page="1051" end_page="1051" type="sub_section">
      <SectionTitle>
5.4 Korean/English Alignment System
</SectionTitle>
      <Paragraph position="0"> An alignment system gathers correspondences between surface representations of both languages. (Shin, 1996) experimented expectation-maximization algorithm with 68.7 % accuracy at phrase level, and this will be incorporated into version 2 platform.</Paragraph>
    </Section>
    <Section position="5" start_page="1051" end_page="1051" type="sub_section">
      <SectionTitle>
5.5 KWIC Manager
</SectionTitle>
      <Paragraph position="0"> Keyword-in-context (KWIC) manager deals with word usage of text corpus. Its functions include indexing and searching word phrases, morphemes or unigrams, applying logic operations (AND, OR, NO2) to them, and sorting the results.</Paragraph>
    </Section>
    <Section position="6" start_page="1051" end_page="1051" type="sub_section">
      <SectionTitle>
5.6 Text/Dictionary Management
System
</SectionTitle>
      <Paragraph position="0"> TI)MS' goals are twofold: to provide customi&gt; able information extraction/indexing/search tools and managerial functions for text data base; and to provide an environment for dictionary deveb opment and management as well as converting or merging existing dictionaries to the intended one according to user's specification.</Paragraph>
      <Paragraph position="1"> Because of the big size of each text to be stored and lots of keywords to be indexed and searched for each text, it requires special storing and managing mechanisms. This is also the ease for the dictionary management. For the extensibility and adaptability, we have devised standard dictionary markup language based on SGML. Templates (dictionary features, text descriptors, and relations among those), specifications for text/dictionary editor and format translator have been also being designed and low-level design is being undertaken. This work is being coded on PC Windows and will output the first draft version this year.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML