File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2165_metho.xml

Size: 11,346 bytes

Last Modified: 2025-10-06 14:07:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2165">
  <Title>KCAT : A Korean Corpus Annotating Tool Minimizing Human Intervention</Title>
  <Section position="3" start_page="1096" end_page="1098" type="metho">
    <SectionTitle>
3. Proposed POS Tagging ToohKCAT
</SectionTitle>
    <Paragraph position="0"> The proposed POG tagging tool is used to combine the manual tagging method and the automatic tagging method. They are integrated to increase the accuracy o\[&amp;quot; the automatic tagging method and to minimize the amount of tile human labor of thc manual tagging method.</Paragraph>
    <Paragraph position="1"> Figure 1 shows the overall architecture of the proposed tagging tool :KCAT.</Paragraph>
    <Paragraph position="3"> As shown in figm'e 1, KCAT consists of three modules: the pre-processing module, the automatic tagging module, and the post-processing module. In the prcoprocessing module, the disambiguation rules are acquired I%m human experts. The candidate words are Ihe target words whose disambiguation rules are acquired. The candidate words can be unknown words and also very frequent words. In addition, the words with problematic ambiguity for tlle automatic tagger can become candidates.</Paragraph>
    <Paragraph position="4"> l)lsamblguation rules are acquired with minimal human labor using tile tool t:n'oposed in (Lee, 1996). In the automatic tagging naodule, the disambiguation rules resolve the ambiguity of {,'very word to which they can be applied.</Paragraph>
    <Paragraph position="5"> I lowever, tile rules are certainly not sufficient to resolve all the ambiguity of the whole words in file corpus. The proper tags are assigned to the remaining ambiguous words by a stochastic &lt; t~&amp;quot; c, hLllllan lagger. After the automatic t,t~m~, a expert corrects tile onors o\[ the stochastic ta,me, The system presents the expert with the results of the stochastic tagger. If the result is incorrect, tile hulllan expel1 corrects the error and generates a disambiguation rule ~br the word.</Paragraph>
    <Paragraph position="6"> The rule is also saved in the role base in order to bc used later.</Paragraph>
    <Paragraph position="7"> 3. I. l.exical Rules for Disambiguation There are many ambiguous words that are extremely difficult to resolve alnbiguities by using a stochastic tagger. Due to the problematic words, manual tagging and manual correction must be done to build a correct coqms. Such human intervention may be repeated again and again to tag or to correct tile same word in the same context.</Paragraph>
    <Paragraph position="8"> For example, a human expert should assign  invented a flying plane) In the above sentences, human experts can resolve the word, 'Na-Nemf with only the previous and ttle next lexical information: 'fla-Neul-Eul' and 'Pi-tlaeng- Ki-Reul'. In other words, tile human expert has to waste time on tagging the same word in tile same context repeatedly. This inefficiency can also be happened in the manual correction of the ntis-tagged words. So, if the human expert can make a rule with his disambiguation knowledge and use it for tile same words in tile same context, such inefficiency can be minimized. We define the disambiguation rule as a lexical rule. Its template is as follows.</Paragraph>
    <Paragraph position="9"> \[P:N\] \[Current Word\] \[Context\] = \[Tagging P, esuh\] Context * Previous wordsdegp * Next Wordsdeg,, Ill tile above template, p and n mean tile previous and the next context size respectively. For the present, p and n are limited to 3. '*'  represents the separating mark between the previous and next context. For example, tile rule</Paragraph>
    <Paragraph position="11"> the tag 'Nal(flying)/Verb + Neun/Ending' should be assigned to the word 'Na-Neun' when the previous word and the next word is 'Ha-Neul-Eul' and 'Pi-Haeng-Ki-Reul'.</Paragraph>
    <Paragraph position="12"> Although these lexical rules cannot always correctly disambiguate all Korean words, they are enough to cover many problematic ambignous words. We can gain some advantages of using the lexical rule. First, it is very accurate because it refers to the very specific lexical information. Second, the possibility of rule conflict is very little even though the number of the rules is increased. Third, it can resolve problematic ambiguity that cannot be resolved without semantic inf'onnation(Lim, 1996).</Paragraph>
    <Section position="1" start_page="1097" end_page="1097" type="sub_section">
      <SectionTitle>
3.2. Lexicai Rule Acquisition
</SectionTitle>
      <Paragraph position="0"> Lexical rules are acquired for the unknown words and the problematic words that are likely to be tagged erroneously by an automatic tagger.</Paragraph>
      <Paragraph position="1"> Lexical rule acquisition is perlbrmed by following steps: 1. The system builds a candidate list of words li)r which the lexical rules would be acquired. The candidate list is the collection of all examples of unknown words and problematic words for an automatic tagger.</Paragraph>
      <Paragraph position="2">  2. A human expert selects a word from the list and makes a lexical rule for the word. 3. The system applies tile lexical rule to all examples of the selected word with same context and also saves the lexical rule in the rule base.</Paragraph>
      <Paragraph position="3"> 4. P, epeat tile steps 2 and 3 until all examples of the candidate words can be tagged by the acquired lexical rules.</Paragraph>
      <Paragraph position="4"> 3.3. Automatic Ta,,,in,,  In the automatic ta,,~dn-oo ~ phase, words are disambiguated by using the lexical rules and a stochastic tagger. To armotate a word in a raw corpus, the rule-based tagger first searches the lexical rule base to find a lexical rule that can be nlatched with tile given context. If a matching rnle is found, the system assigns the result of the rule to the word. According to the corresponding rule, a proper tag is assigned to a word. With tile lexical rules~ a very precise tag can be assigned to a word. However, because the lexical rules do not resolve all the ambiguity of the whole corpus, we must make use of a stochastic tagger. We employ an HMM--based POS tagger for this purpose(Kim,1998). The stochastic tagger assigns the proper tags to the ambiguous words afier the rule application.</Paragraph>
      <Paragraph position="5"> Alter disambiguating the raw corpus using the lexical rules and the atttomatic tagger, we arrive at the frilly disambiguated result. But the word tagged by the stochastic tagger may have a chance to be mis-tagged. Therefore, the post-processing for error correction is required for the words tagged by the stochastic tagger.</Paragraph>
    </Section>
    <Section position="2" start_page="1097" end_page="1098" type="sub_section">
      <SectionTitle>
3.4. Error Correction
</SectionTitle>
      <Paragraph position="0"> The human expert carries out the error correction task for the words tagged by a stochastic tagger. This error correction also requires tile repeatecl human labor as in the manual tagging. We employ the similar way of the rule acquisition to reduce the human labor needed for manual error cmTection. The results of the automatic tagger are marked to be distinguished from tile results of the rule-based tagger. The human expert checks the marked words only. If an error is found, the ht/man expert assigns a correct tag to the word. When tile expert corrects the erroneous word, tile system automatically generates a lexicat rule and stores it in tile rnle base. File newly acquired rule is autoinatically applied to the rest of tile corpus. Thus, the expert does not need to correct the repeated errors.</Paragraph>
      <Paragraph position="1">  Based on the proposed method~ we have imrdemented, a corpus--annotating tool for Koreart which is named as KCAT(Korean Corpus Annotating 'Fool). The process of building large corpora with KCAT is as lbllows: 1. The lexical roles in the rule base are applied to a raw corpu::,. If the rule base i!; empty, nothing will be done.</Paragraph>
      <Paragraph position="2">  the stochastic tagger, and lexical rules for those errors are also stored in the role--base.</Paragraph>
      <Paragraph position="3"> 6. For other corpus, repeat the steps 1 through 5.</Paragraph>
      <Paragraph position="4"> Figure 2 shows a screenshot of KCAT. In this figure, &amp;quot;A' window represents the list of raw corpus arm a &amp;quot;B' window contains the contcnt of the selected raw corpus in the window A. The tagging result is displayed in the window 'C'. Words beginning with &amp;quot;&gt;' are tagged by a stocha,,;tic la-&lt;,e, and the other words are ta~Eed by lexical rules.</Paragraph>
      <Paragraph position="5"> We can -et the more lexical rules as the ta,,,,itw process is prom-esscd. Therefore, we can expect that the aecunu-y and the reduction rate C of human htbor are increased a~ long as the tagging process is corltilmed.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1098" end_page="1099" type="metho">
    <SectionTitle>
5. Experimental Results
</SectionTitle>
    <Paragraph position="0"> In order to estimate tim experimental results of our system, we collected the highly ambiguous words and frequently occurring words in our test corpus with 50,004 words.</Paragraph>
    <Paragraph position="1"> \]able I shows reductions in human intervention required to armotate the raw coums when we use lexical rules lbr the highly ambiguous words and the frequently occurring words respectively. The second colurnn shows that we examined the 4,081 OCCLirrences of 2,088 words with tag choices above 7 and produced 4,081 lexical rules covering 4,832 occurrences of the corpl_lS. In this case, the reduction rate of human intervention is 1.5%. ~ The third column shows that we exalnined thc 6,845 occurrences of 511 words with ficqucncy above 10 and produced 6,845 lexical rules covering 15,4 l 8 occurrences of the corpus. In tiffs case, the reduction rate of human intervention is 17%. 2 The last row in the table shows how intbrnmtive the rules are. We measured it by the inq-~iovement rate of stochastic tagging ;_!.l'l.el- the rules arc applied. From these experimental result.~;, wc can judge that rule-acquisition from flcquelatly occurring words is preferable.</Paragraph>
    <Paragraph position="3"> Table 2 shows the results of our experiments on tile applicability of lexical rules. We measure it by the improyement rate of stochastic tagging alter the rules acquired from other corpus are applied.</Paragraph>
    <Paragraph position="4"> The third row shows that we annotate a training corpus with 10,032 words and produce 631 lexieal rules, which can be applied to another test corpus to reduce tile number of the stochastic ta-,,in,, errors frorn 697 to 623. 3 The ~brth and fifth row show that as the number of lexical rules is increased, the number of the errors of the tagger is decreased on the test corpus.</Paragraph>
    <Paragraph position="5"> These experilnental results demonstrate tile promise of gradual decrement of human intervention and improvement of tagging accuracy in annotating corpora.</Paragraph>
  </Section>
  <Section position="5" start_page="1099" end_page="1099" type="metho">
    <SectionTitle>
6. Conclusion
</SectionTitle>
    <Paragraph position="0"> The main goal of our work is to dcvelop an efficiclat tool which supports to build a very</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML