File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2016_intro.xml
Size: 2,573 bytes
Last Modified: 2025-10-06 14:00:42
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2016"> <Title>Rapid Parser Development: A Machine Learning Approach for Korean</Title> <Section position="3" start_page="0" end_page="118" type="intro"> <SectionTitle> 2 Korean </SectionTitle> <Paragraph position="0"> Like Japanese, Korean is a head-final agglutinative language. It is written in a phonetic alphabet called hangul, in which each two-byte character represents one syllable. While our parser operates on the original Korean hangul, this paper presents examples in a romanized transcription. In sentence (1) for example, the verb is preceded by a number of so-called eojeols (equivalent to bunsetsus in Japanese) like &quot;chaeg-eul&quot;, which are typically composed of a content part (&quot;chaeg&quot; = book) and a postposition, which often corresponds to a preposition in English, but is also used as a marker of topic, subject or object (&quot;eul&quot;).</Paragraph> <Paragraph position="2"> Na-neun eo-je geu chaeg-eul sass-da.</Paragraph> <Paragraph position="3"> ITOPIC yesterday this bookoBJ bought. (1) I bought this book yesterday.</Paragraph> <Paragraph position="4"> Our parser produces a tree describing the structure of a given sentence, including syntactic and semantic roles, as well as additional information such as tense. For example, the parse tree for sentence (1) is shown below: \[1\] na-netm eo-je geu chaeg-eul sass-da. \[S\] (SUB J) \[2\] na-neun \[NP\] (HEAD) \[3\] na \[KEG-NOUN\] (PARTICLE) \[4\] neun \[DUPLICATE-PRT\] (TIME) \[5\] eo-je \[REG-ADVERB\] (HEAD) \[6\] eo-je \[REG-ADVERB\] (OBJ) \[7\] geu chaeg-eul \[NP\] (MOD) \[8\] geu \[DEMONSTR-ADNOMINAL\] (HEAD) \[9\] geu \[DEMONSTR-ADNOMINAL\] (HEAD) \[I0\] chaeg-eul \[NP\] (HEAD) \[II\] chae E \[KEG-NOUN\] (PARTICLE) \[12\] eul \[OBJ-CASE-PRT\] (HEAD) \[13\] sass-da. \[VERB; PAST-TENSE\] (HEAD) \[14\] sa \[VERB-STEM\] (SUFFIX) \[15\] eoss \[INTEEMED-SUF-VERB\] (SUFFIX) \[16\] da \[CONNECTIVE-SUF-VERB\] (DUMMY) \[17\] . \[PERIOD\] For preprocessing, we use a segmenter and morphological analyzer, KMA, and a tagger, KTAG, both provided by the research group of Prof. Rim of Korea University. KMA, which comes with a built-in Korean lexicon, segments Korean text into eojeols and provides a set of possible sub-segmentations and morphological analyses. KTAG then tries to select the most likely such interpretation. Our parser is initialized with the result of KMA, preserving all interpretations, but marking KTAG's choice as the top alternative.</Paragraph> </Section> class="xml-element"></Paper>