XML Viewer - e99-1018

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1018_metho.xml
Size: 14,310 bytes
Last Modified: 2025-10-06 14:15:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1018">
  <Title>POS Disambiguation and Unknown Word Guessing with Decision Trees</Title>
  <Section position="3" start_page="134" end_page="134" type="metho">
    <SectionTitle>
2 Tagger Architecture
</SectionTitle>
    <Paragraph position="0"> Figure 1 illustrates the functional components of the tagger and the order of processing:  Raw text passes through the Tokenizer, where it is converted to a stream of tokens. Non-word tokens (e.g., punctuation marks, numbers, dates, etc.) are resolved by the Tokenizer and receive a tag corresponding to their category. Word tokens are looked-up in the Lexicon and those found receive one or more tags. Words with more than one tags and those not found in the Lexicon pass through the Disambiguator/Guesser, where the contextually appropriate tag is decided/guessed. The Disambiguator/Guesser is a 'forest' of decision trees, one tree for each ambiguity scheme present in M. Greek and one tree for unknown word guessing. When a word with two or more tags appears, its ambiguity scheme is identified. Then, the corresponding decision tree is selected, which is traversed according to the values of morphosyntactic features extracted from contextual tags. This traversal returns the contextually appropriate POS. The ambiguity is resolved by eliminating the tag(s) with different POS than the one returned by the decision tree.</Paragraph>
    <Paragraph position="1"> The POS of an unknown word is guessed by traversing the decision tree for unknown words, which examines contextual features along with the word ending and capitalization and returns an open-class POS.</Paragraph>
  </Section>
  <Section position="4" start_page="134" end_page="2638" type="metho">
    <SectionTitle>
3 Training Sets
</SectionTitle>
    <Paragraph position="0"> For the study and resolution of lexical ambiguity in M. Greek, we set up a corpus of 137.765 tokens (7.624 sentences), collecting sentences from student writings, literature, newspapers, and technical, financial and sports magazines.</Paragraph>
    <Paragraph position="1"> We made sure to adequately cover all POS ambiguity schemes present in M. Greek, without showing preference to any scheme, so as to have an objective view to the problem. Subsequently, we tokenized the corpus and inserted it into a database and let the lexicon assign a morphosyntactic tag to each word-token. We did not use any specific tag-set; instead, we let the lexicon assign to each known word all morphosyntactic attributes available. Table 1 shows a sample sentence after this initial tagging (symbolic names appearing in the tags are explained in Appendix A).</Paragraph>
    <Paragraph position="2">  To words with POS ambiguity (e.g., tokens #2 and #3 in Table 1) we manually assigned their contextually appropriate POS. To unknown words (e.g., token #5 in Table 1), which by default received a disjunct of open-class POS labels, we manually assigned their real POS and declared explicitly their inflectional ending.</Paragraph>
    <Paragraph position="3"> At a next phase, for all words relative to a specific ambiguity scheme or for all unknown words, we collected from the tagged corpus their automatically and manually assigned tags along with the automatically assigned tags of their neighboring tokens. This way, we created a training set for each ambiguity scheme and a training set for unknown words. Table 2 shows a 10-example fragment from the training set for the ambiguity scheme Verb-Noun. For reasons of space, Table 2 shows the tags of only the previous (column Tagi_l) and next (column Tagi+~) tokens in the neighborhood of an ambiguous word, whereas more contextual tags actually comprise a training example. A training example also includes the manually assigned tag (column Manual Tagi) along with the automatically assigned tag 2 (column Tagi) of the ambiguous word. One can notice that some contextual tags are missing (e.g., Tagi_~ of Example 7; the ambiguous word is the first in the sentence), or some contextual tags may exhibit POS ambiguity (e.g., Tagi+l of Example 1), an incident implying that the learner must learn from incomplete/ambiguous examples, since this is the case in real texts.</Paragraph>
    <Paragraph position="4"> If we consider that a tag encodes 1 to 5 morphosyntaetic features, each feature taking one or a disjunction of 2 to 11 values, then the total number of different tags counts up to several hundreds 3. This fact prohibits the feeding of the training algorithms with patterns that have the form: (Tagi_2, Tagi_b Tagi, Tagi.~, Manual_Tagi), which is the ease for similar systems that learn POS disambiguation (e.g., Daelemans et al., 1996). On the other hand, it would be inefficient (yielding to information loss) to generate a simplified tag-set in order to reduce its size. The 'what the training patterns should look like' bottleneck was surpassed by assuming a set of functions that extract from a tag the value(s) of specific features, e.g.:</Paragraph>
    <Paragraph position="6"> With the help of these functions, the training examples shown in Table 2 are interpreted to patterns that look like:  that is, a sequence of feature-values extracted from the previous/current/next tags along with the manually assigned POS label.</Paragraph>
    <Paragraph position="7"> Due to this transformation, two issues automatically arise: (a) A feature-extracting function may return more than one feature value (as in the Gander(...) example); consequently, the training algorithm should be capable of handling set-valued features. (b) A feature-extracting function may return no value, e.g.</Paragraph>
    <Paragraph position="9"> thus we added an extra value -the value None-to each feature 4.</Paragraph>
    <Paragraph position="10"> To summarize, the training material we prepared consists of: (a) a set of training examples for each ambiguity scheme and a set of training examples for unknown words 5, and (b) a set of features accompanying each example-set, denoting which features (extracted from the tags of training examples) will participate in the training procedure. This configuration offers the following advantages:  1. A training set is examined only for the features that are relative to the corresponding ambiguity scheme, thus addressing its idiosyncratic needs.</Paragraph>
    <Paragraph position="11"> 2. What features are included to each feature-set depends on the linguistic reasoning on the specific ambiguity scheme, introducing this way linguistic bias to the learner.</Paragraph>
    <Paragraph position="12"> 3. The learning is tag-set independent, since it is based on specific features and not on the entire tags.</Paragraph>
    <Paragraph position="13"> 4. The learning of a particular ambiguity  scheme can be fine-tuned by including new features or excluding existing features from its feature-set, without affecting the learning of the other ambiguity schemes.</Paragraph>
  </Section>
  <Section position="5" start_page="2638" end_page="2638" type="metho">
    <SectionTitle>
4 Decision Trees
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2638" end_page="2638" type="sub_section">
      <SectionTitle>
4.1 Tree Induction
</SectionTitle>
      <Paragraph position="0"> In the previous section, we stated the use of linguistic reasoning for the selection of feature4 e.g.: Gender = {Masculine, Feminine, Neuter, None}. 5 The training examples for unknown words, except contextual tags, also include the capitalization feature and the suffixes of unknown words.</Paragraph>
      <Paragraph position="1"> sets suitable to the idiosyncratic properties of the corresponding ambiguity schemes.</Paragraph>
      <Paragraph position="2"> Formally speaking, let FS be the feature-set attached to a training set TS. The algorithm used to transform TS into a decision tree belongs to the TDIDT (Top Down Induction of Decision Trees) family (Quinlan, 1986). Based on the divide and conquer principle, it selects the best Fbe, t feature from FS, partitions TS according to the values of Fbest and repeats the procedure for each partition excluding Fbest from FS, continuing recursively until all (or the majority of) examples in a partition belong to the same class C or no more features are left in FS.</Paragraph>
      <Paragraph position="3"> During each step, in order to find the feature that makes the best prediction of class labels and use it to partition the training set, we select the feature with the highest gain ratio, an information-based quantity introduced by Quinlan (1986). The gain ratio metric is computed as follows: Assume a training set TS with patterns belonging to one of the classes C1, C2, ... Ck.</Paragraph>
      <Paragraph position="4"> The average information needed to identify the class of a pattern in TS is:</Paragraph>
      <Paragraph position="6"> Now consider that TS is partitioned into TSI, TSz, ... TS., according to the values of a feature F from FS. The average information needed to identify the class of a pattern in the partitioned</Paragraph>
      <Paragraph position="8"> measures the information relevant to classification that is gained by partitioning TS in accordance with the feature F. Gain ratio is a normalized version of information gain: gain ratio(F) = gain(F) split info(F) Split info is a necessary normalizing factor, since gain favors features with many values, and represents the potential information generated by dividing TS into n subsets: split info(F) = -PS ITsi Ix ldegg2 (IIT:~ I)</Paragraph>
      <Paragraph position="10"> Taking into consideration the formula that computes the gain ratio, we notice that the best feature is the one that presents the minimum entropy in predicting the class labels of the training set, provided the information of the feature is not split over its values.</Paragraph>
      <Paragraph position="11"> The recursive algorithm for the decision tree induction is shown in Figure 2. Its parameters are: a node N, a training set TS and a feature set FS. Each node constructed, in a top-down left-to-right fashion, contains a default class label C (which characterizes the path constructed so far) and if it is a non-terminal node it also contains a feature F from FS according to which further branching takes place. Every value vi of the feature F tested at a non-terminal node is accompanied by a pattern subset TSj (i.e., the subset of patterns containing the value vi). If two or more values of F are found in a training pattern (set-valued feature), the training pattern is directed to all corresponding branches. The algorithm is initialized with a root node, the entire training set and the entire feature set. The root node contains a dummy 6 feature and a blank class label.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="2638" end_page="2638" type="metho">
    <SectionTitle>
6 The dummy feature contains the sole value None.
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2638" end_page="2638" type="sub_section">
      <SectionTitle>
4.2 Tree Traversal
</SectionTitle>
      <Paragraph position="0"> Each tree node, as already mentioned, contains a class label that represents the 'decision' being made by the specific node. Moreover, when a node is not a leaf, it also contains an ordered list of values corresponding to a particular feature tested by the node. Each value is the origin of a subtree hanging under the non-terminal node.</Paragraph>
      <Paragraph position="1"> The tree is traversed from the root to the leaves.</Paragraph>
      <Paragraph position="2"> Each non-terminal node tests one after the other its feature-values over the testing pattern. When a value is found, the traversal continues through the subtree hanging under that value. If none of the values is found or the current node is a leaf, the traversal is finished and the node's class label is returned. For the needs of the POS disambiguation/guessing problem, tree nodes contain POS labels and test morphosyntactic features. Figure 3 illustrates the tree-traversal algorithm, via which disarnbiguation/guessing is performed. The lexical and/or contextual features of an ambiguous/unknown word constitute a testing pattern, which, along with the root of the decision tree corresponding to the specific ambiguity scheme, are passed to the tree-traversal algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="2638" end_page="2638" type="sub_section">
      <SectionTitle>
4.3 Subtree Ordering
</SectionTitle>
      <Paragraph position="0"> The tree-traversal algorithm of Figure 3 can be directly implemented by representing the decision tree as nested if-statements (see Appendix B), where each block of code following an if-statement corresponds to a subtree. When an if-statement succeeds, the control is transferred to the inner block and, since there is no backtracking, no other feature-values of the same level are tested. To classify a pattern with a set-valued feature, only one value  from the set steers the traversal; the value that is tested first. A fair policy suggests to test first the most important (probable) value, or, equivalently, to test first the value that leads to the subtree that gathered more training patterns than sibling subtrees. This policy can be incarnated in the tree-traversal algorithm if we previously sort the list of feature-values tested by each non-terminal node, according to the algorithm of Figure 4, which is initialized with the root of the tree.</Paragraph>
      <Paragraph position="1">  This ordering has a nice side-effect: it increases the classification speed, as the most probable paths are ranked first in the decision tree.</Paragraph>
    </Section>
    <Section position="3" start_page="2638" end_page="2638" type="sub_section">
      <SectionTitle>
4.4 Tree Compaction
</SectionTitle>
      <Paragraph position="0"> A tree induced by the algorithm of Figure 2 may contain many redundant paths from root to leaves; paths where, from a node and forward, the same decision is made. The tree-traversal definitely speeds up by eliminating the tails of the paths that do not alter the decisions taken thus far. This compaction does not affect the performance of the decision tree. Figure 5 illustrates the tree-compaction algorithm, which is initialized with the root of the tree.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML