File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3010_intro.xml
Size: 1,736 bytes
Last Modified: 2025-10-06 14:02:30
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3010"> <Title>Part-of-Speech Tagging Considering Surface Form for an Agglutinative Language</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Part-of-speech (POS) tagging is a job to assign a proper POS tag to each linguistic unit such as word for a given sentence. In English POS tagging, word is used as a linguistic unit. However, the number of possible words in agglutinative languages such as Korean is almost infinite because words can be freely formed by gluing morphemes together.</Paragraph> <Paragraph position="1"> Therefore, morpheme-unit tagging is preferred and more suitable in such languages than word-unit tagging. Figure 1 shows an example of morpheme structure of a sentence, where the bold lines indicate the most likely morpheme-POS sequence. A solid line represents a transition between two morphemes across a word boundary and a dotted line represents a transition between two morphemes in a word.</Paragraph> <Paragraph position="2"> The previous probabilistic POS models for agglutinative languages have considered only lexical forms of morphemes, not surface forms of words.</Paragraph> <Paragraph position="3"> This causes an inaccurate calculation of the probability. The proposed model is based on the observation that when there exist words (surface forms) that share the same lexical forms, the probabilities to appear are different from each other. Also, it is designed to consider lexical form of word. By experiments, we show that the proposed model outperforms the bigram Hidden Markov model (HMM)-based tagging model.</Paragraph> </Section> class="xml-element"></Paper>