XML Viewer - h93-1045

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1045_metho.xml
Size: 13,447 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1045">
  <Title>EXAMPLE-BASED CORRECTION OF WORD SEGMENTATION AND PART OF SPEECH LABELLING</Title>
  <Section position="4" start_page="227" end_page="228" type="metho">
    <SectionTitle>
2. ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> The architecture in Figure 1 was chosen to minimize labor and to maximize use of existing software. It employs JUMAN first to provide initial word segmentation of the text, an annotation-based algorithm second to correct both segmentation errors and part of speech errors in JUMAN output, and POST third both to select among ambiguous alternative segmentations/partof-speech assignments and also to predict the part of speech of unknown words.</Paragraph>
    <Paragraph position="1">  Let us briefly review each component. JUMAN, available from Kyoto University makes segmentation decisions and part of speech assignments to Japanese text. To do this, it employs a lexicon of roughly 40,000 words, including their parts of speech. Where alternative segmentations are possible, the connectivity matrix eliminates some possibilities, since it states what parts of speech may follow a given part of speech. Where the connectivity matrix does not dictate a single segmentation and part of speech, generally longer words are preferred over shorter segmentations.</Paragraph>
    <Paragraph position="2"> An example JUMAN output is provided in Figure 2.</Paragraph>
    <Paragraph position="3"> The Japanese segment is given first, followed by a slash  and the part of speech. JUMAN employs approximately 45 parts of speech. 1 FIGURE 2a: A Short Example Sentence ~) ~&amp;quot;~NB deg /KT FIGURE 2b: JUMAN output for example 2a above  The correction algorithm (AMED) is trained with two parallel annotations of the same text. One of the annotations is JUMAN's output. The second is manually annotated corresponding to correct segmentation and correct part-of-speech assignments for each word. During training, AMED aligns the parallel annotations, identifies deviations as &amp;quot;corrections&amp;quot;, and automatically generalizes these into correction rules. An example of automatic alignment appears in Figure 3.</Paragraph>
    <Paragraph position="4"> AMED performs the following functions: * Corrects some segmentation errors made by JUMAN.</Paragraph>
    <Paragraph position="5"> Corrects some part-of-speech assignment errors made by JUMAN. Some of these &amp;quot;corrections&amp;quot; actually introduce ambiguity which POST later resolves.</Paragraph>
    <Paragraph position="6"> Transforms the tag set produced by JUMAN into the tag set required by the grammar.</Paragraph>
    <Paragraph position="7"> Note that all of these functions are the result of the learning algorithm, no rules for correction nor for translating JUMAN parts of speech into those for the grammar were written by hand.</Paragraph>
    <Paragraph position="8"> The third component is POST, which assigns parts of speech stochastically via a Hidden Markov model, has been described elsewhere \[Meteer, et al., 1991\]. POST performs two vital functions in the case of our Japanese processing:  POST decides among ambiguous part-of-speech labellings and segmentations, particularly in those cases where AMED's training data includes cases where JUMAN is prone to error.</Paragraph>
    <Paragraph position="9"> POST predicts the most likely part of speech for an unknown word segment in context.</Paragraph>
  </Section>
  <Section position="5" start_page="228" end_page="228" type="metho">
    <SectionTitle>
3. HOW THE ARCHITECTURE
ADDRESSES THE ISSUES
</SectionTitle>
    <Paragraph position="0"> In principle, a Hidden Markov Model implementation, such as POST, can make both part-of-speech decisions and segment text quite reliably. Therefore, why not just use POST; why use three components instead? The clear reason was to save human effort. We did not have access to segmented and labelled Japanese text.</Paragraph>
    <Paragraph position="1"> Labelling tens of thousands (or even hundreds of thousands of words of text) for supervised training would have taken more effort and more time in a project with tight schedules and limited resources. JUMAN existed and functioned above 90% accuracy in segmentation.</Paragraph>
    <Paragraph position="2"> A secondary reason was the opportunity to investigate an algorithm that learned correction rules from examples. A third reason was that we did not have an extensive lexicon using the parts of speech required by the grammar.</Paragraph>
    <Paragraph position="3"> The architecture addressed the four issues raised in the introduction as follows:  1) AMED learned rules to transform JUMAN's parts of speech to those required by the grammar.</Paragraph>
    <Paragraph position="4"> 2) Accuracy was improved both by AMED's correction rules and by POST's Hidden Markov Model.</Paragraph>
    <Paragraph position="5"> 3) POST hypothesizes the most likely part of  speech in context for unknown words, words not in the JUMAN lexicon.</Paragraph>
    <Paragraph position="6"> 4) The sample inspection method in AMED estimates probabilities for low frequency phenomena.</Paragraph>
  </Section>
  <Section position="6" start_page="228" end_page="229" type="metho">
    <SectionTitle>
4. THE CORRECTION MODEL
</SectionTitle>
    <Paragraph position="0"> The only training data for our algorithm is manually annotated word segmentation and part of speech labels.</Paragraph>
    <Paragraph position="1"> Examples of corrections of JUMAN's output are extracted by a procedure that automatically aligns the annotated data with JUMAN's output and collects pairs of differences between sequences of pairs of word segment and part of speech. Each pair of differing strings represents a correction rule; the procedure also generalizes the examples to create more broadly applicable correction rules.</Paragraph>
    <Paragraph position="2">  We estimate probabilities for the correction rules via the sample inspection method. (see the Appendix.) Here, significance level is a parameter, from a low of 0.1 for ambitious correction through a high of 0.9 for conservative correction. The setting gives us some trade-off between accuracy and the degree of ambiguity in the results. One selects an appropriate value by empirically testing performance over a range of parameter settings. Correction rules are ordered and applied based on probability estimates.</Paragraph>
    <Paragraph position="3"> When a rule matches, l) AMED corrects JUMAN's output if the probability estimate exceeds a user-specified threshold, 2) AMED introduces an alternative if the probability falls below that threshold but exceeds a second user-supplied threshold, or 3) AMED makes no change if the probability estimate falls below both thresholds.</Paragraph>
    <Paragraph position="4">  As a result, a chart representing word segmentation and part of speech possibilities is passed to POST, which was easily modified to handle a chart as input, since the underlying Viterbi algorithm applies equally well to a chart. POST then selects the most likely combination of word segmentation and part of speech labels according to a bi-gram probability model.</Paragraph>
    <Paragraph position="6"/>
    <Paragraph position="8"/>
  </Section>
  <Section position="7" start_page="229" end_page="230" type="metho">
    <SectionTitle>
5. EXPERIENCE
</SectionTitle>
    <Paragraph position="0"> The motivation for this study was the need to port our PLUM data extraction system \[Weischedel, et al., 1992\] to process Japanese text. The architecture was successful enough that it is part of (the Japanese version of) PLUM now, and has been used in Government-sponsored evaluations of data extraction systems in two domains: extracting data pertinent to joint ventures and extracting data pertinent to advances in microelectronics fabrication technology. It has therefore been run over corpora of over 300,000 words.</Paragraph>
    <Paragraph position="1"> There are two ways we can illustrate the effect of this architecture: a small quanitative experiment and examples of generalizations made by AMED.</Paragraph>
    <Section position="1" start_page="229" end_page="229" type="sub_section">
      <SectionTitle>
5.1 A Small Experiment
</SectionTitle>
      <Paragraph position="0"> We ran a small experiment to measure the effect of the architecture (JUMAN + AMED + POST), contrasted with JUMAN alone. Japanese linguistics students corrected JUMAN's output; the annotation rate of an experienced annotator is roughly 750 words per hour, using the TREEBANK annotation tools (which we had ported to Japanese). In the first experiment, we used 14,000 words of training data and 1,400 words of test data. In a second experiment, we used 81,993 words of training data and a test set of 4,819 words.</Paragraph>
      <Paragraph position="1"> Remarkably the results for the two cases were almost identical in error rate. In the smaller test (of 1,400 words), the error rate on part-of-speech labelling (given correct segmentation) was 3.6%, compared to 8.5%; word segmentation error was reduced from 9.4% to 8.3% using the algorithm. In the larger test (of 4,819 words), the error rate on part-of-speech labelling (given correct segmentation) was 3.4%, compared to 8.2%; word segmentation error was reduced from 9.4% to 8.3% using the algorithm.</Paragraph>
      <Paragraph position="2"> Therefore, using the AMED correction algorithm plus POST's hidden Markov model reduced the error rate in part of speech by more than a factor of two. Reduction in word segmentation was more modest, a 12% improvement.</Paragraph>
      <Paragraph position="3"> Error rate in part-of-speech labelling was therefore reduced to roughly the error rate in English, one of our original goals.</Paragraph>
      <Paragraph position="4"> Both segmentation error and part of speech error could be reduced further by increasing the size of JUMAN's lexicon and/or by incorporating additional generalization patterns in AMED's learning alogrithm. However, in terms of improving PLUM's overall performance in extracting data from Japanese text, reducing word segmentation error or part-of-speech error are not the highest priority.</Paragraph>
    </Section>
    <Section position="2" start_page="229" end_page="230" type="sub_section">
      <SectionTitle>
5.2 Examples of Rules Learned
</SectionTitle>
      <Paragraph position="0"> One restriction we imposed on generalizations considered by the algorithm is that rules must be based on the first or last morpheme of the pattern. This is based on the observation in skimming the result of alignment that the first or last morpheme is quite informative. Rules which depend critically on a central element in the difference between aligned JUMAN output and supervised training were not considered. A second limitation that we imposed on the algorithm was that the fight hand side of any correction rule could only contain one element, instead of the general case. Three kinds of correction rules can be inferred.</Paragraph>
      <Paragraph position="1"> * A specific sequence of parts of speech in JUMAN's output can be replaced by a single morpheme with one part of speech.</Paragraph>
      <Paragraph position="2"> * A specific sequence of parts of speech plus a specific word at the left edge can be replaced by a single morpheme with one part of speech.</Paragraph>
      <Paragraph position="3">  A specific sequence of parts of speech plus a specific word at the right edge can be replaced by a single morpheme with one part of speech.</Paragraph>
      <Paragraph position="4"> The critical statistic in selecting among the interpretations is the fraction of times a candidate rule correctly applies in the training data versus the number of times it applies in the training. In spite of these selfimposed limitations in this initial implementation, the rules that are learned improved both segmentation and labelling by part of speech, as detailed in Section 5.1. Here we illustrate some useful generalizations made by the algorithm and used in our Japanese version of the PLUM data extraction system.</Paragraph>
      <Paragraph position="5"> In example (1) below, the hyptohesized rule essentially recognizes proper names arising from an unknown, a punctuation mark, and a proper noun; the rule hypothesizes that the three together are a proper noun. This pattern only arises in the case of person names (an initial, a period, and a last name) in the training corpus.</Paragraph>
      <Paragraph position="7"> Example (2) is a case where an ambiguous word Cnerai&amp;quot;, meaning a&amp;quot;aim&amp;quot; or &amp;quot;purpose&amp;quot;) is rarely used as a verb, but JUMAN's symbolic rules are predicting it as a verb.</Paragraph>
      <Paragraph position="8"> The rule corrects the rare tag to the more frequent one,  common noun.</Paragraph>
      <Paragraph position="9"> 2. ~.\[t ~,~NB ===&gt; CN ~t. ~ a/VB ~'jl~. ~ VCN Example (3) represents the equivalent of learning a lexical entry from annotation; if JUMAN had had it in its  lexicon, no correction of segmentation (and part of speech) would have been necessary. There are many similar, multi-character, idiomatic particles in Japanese. Parallel cases arise in English, such as &amp;quot;in spite off and &amp;quot;in regard to&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="230" end_page="230" type="metho">
    <SectionTitle>
3. ~/NCM */PT */CN */PT===&gt; PT
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
class="xml-element"></Paper>
Download Original XML