File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2028_metho.xml

Size: 16,453 bytes

Last Modified: 2025-10-06 14:10:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2028">
  <Title>Using Lexical Dependency and Ontological Knowledge to Improve a Detailed Syntactic and Semantic Tagger of English</Title>
  <Section position="5" start_page="215" end_page="215" type="metho">
    <SectionTitle>
3 Experimental Data
</SectionTitle>
    <Paragraph position="0"> The primary corpus used for the experiments presented in this paper is the ATR General English Treebank. This consists of 518,080 words (approximately 20 words per sentence, on average) of  textannotatedwithadetailedsemanticandsyntactic tagset. To understand the nature of the task involved in the experiments presented in this paper, one needs some familiarity with the ATR General English Tagset. For detailed presentations, see (Black et al., 1996b; Black et al., 1996a; Black and Finch, 2001). An apercu can be gained, however, from Figure 1, which shows two sample sentences from the ATR Treebank (and originally from a Chinese take-out food flier), tagged with respect to the ATR General English Tagset. Each verb, noun, adjective and adverb in the ATR tagset includes a semantic label, chosen from 42 noun/adjective/adverb categories and 29 verb/verbal categories, some overlap existing between these category sets.</Paragraph>
    <Paragraph position="1"> Proper nouns, plus certain adjectives and certain numerical expressions, are further categorized via an additional 35 &amp;quot;proper-noun&amp;quot; categories. These semantic categories are intended for any &amp;quot;Standard-American-English&amp;quot; text, in any domain. Sample categories include: &amp;quot;physical.attribute&amp;quot; (nouns/adjectives/adverbs), &amp;quot;alter&amp;quot; (verbs/verbals), &amp;quot;interpersonal.act&amp;quot; (nouns/adjectives/adverbs/verbs/verbals), &amp;quot;orgname&amp;quot; (proper nouns), and &amp;quot;zipcode&amp;quot; (numericals). They were developed by the ATR grammarian and then proven and refined via day-in-day-out tagging for six months at ATR by two human &amp;quot;treebankers&amp;quot;, then via four months of tagset-testing-only work at Lancaster University (UK) by five treebankers, with daily interactions among treebankers, and between the treebankers and the ATR grammarian. The semantic categorization is, of course, in addition to an extensive syntactic classification, involving some 165 basic syntactic tags.</Paragraph>
    <Paragraph position="2"> The test corpus has been designed specifically to cope with the ambiguity of the tagset. It is possible to correctly assign any one of a number of 'allowable' tags to a word in context. For example, the tag of the word battle in the phrase &amp;quot;a legal battle&amp;quot; could be either NN1PROBLEM or NN1INTER-ACT, indicating that the semantics is either a problem, or an inter-personal action. The test corpus consists of 53,367 words sampled from the same domains as, and in approximately the same proportions as the training data, and labeled with a set of up to 6 allowable tags for each word.</Paragraph>
    <Paragraph position="3"> During testing, only if the predicted tag fails to match any of the allowed tags is it considered an error.</Paragraph>
  </Section>
  <Section position="6" start_page="215" end_page="217" type="metho">
    <SectionTitle>
4 Tagging Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="215" end_page="215" type="sub_section">
      <SectionTitle>
4.1 ME Model
</SectionTitle>
      <Paragraph position="0"> Our tagging framework is based on a maximum entropy model of the following form:</Paragraph>
      <Paragraph position="2"> - t is tag being predicted; - c is the context of t; - g is a normalization coefficient that ensures: SLt=0gproducttextKk=0 afk(c,t)k p0 = 1; - K is the number of features in the model; - L is the number of tags in our tag set; - ak is the weight of feature fk; - fk are feature functions and fkepsilon1{0,1}; - p0 is the default tagging model (in our case,  the uniform distribution, since all of the information in the model is specified using ME constraints).</Paragraph>
      <Paragraph position="3"> Our baseline model contains the following feature predecate set:</Paragraph>
      <Paragraph position="5"> where: - wn is the word at offset n relative to the word whose tag is being predicted; - tn is the tag at offset n; - posn is the syntax-only tag at offset n assigned by a syntax-only tagger; - prefn(w0) is the first n characters of w0; - suffn(w0) is the last n characters of w0; This feature set contains a typical selection of n-gram and basic morphological features. When the tagger is trained in tested on the UPENN treebank(Marcusetal., 1994), itsaccuracy(excluding the posn features) is over 96%, close to the state of the art on this task. (Black et al., 1996b) adopted a two-stage approach to prediction, first predicting syntax, then semantics given the syntax, whereas in (Black et al., 1998) both syntax and semantics were predicted together in one step. In using syntactic tags as features, we take a softer approach to the two-stage process. The tagger has access to accurate syntactic information; however, it is not necessarily constrained to accept this choice of syntax. Rather, it is able to decide both syntax and semantics while taking semantic context into account. In order to find the most probable sequence of tags, we tag in a left-to-right manner using a beam-search algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="215" end_page="217" type="sub_section">
      <SectionTitle>
4.2 Feature selection
</SectionTitle>
      <Paragraph position="0"> For reasons of practicability, it is not always possible to use the full set of features in a model: often it is necessary to control the number of featurestoreduceresourcerequirementsduringtrain- null ing. We use mutual information (MI) to select the most useful feature predicates (for more details, see (Rosenfeld, 1996)). It can be viewed as a means of determining how much information a given predicate provides when used to predict an outcome.</Paragraph>
      <Paragraph position="1"> That is, we use the following formula to gauge a feature's usefulness to the model:</Paragraph>
      <Paragraph position="3"> where: - t [?] T is a tag in the tagset; - f [?] {0,1} is the value of any kind of predicate feature.</Paragraph>
      <Paragraph position="4"> Using mutual information is not without its shortcomings. It does not take into account any of the interactions between features. It is possible for a feature to be pronounced useful by this procedure, whereas in fact it is merely giving the same information as another feature but in different form. Nonetheless this technique is invaluable in practice. It is possible to eliminate features  which provide little or no benefit to the model, thus speeding up the training. In some cases it even allows a model to be trained where it would not otherwise be possible to train one. For the purposes of our experiments, we use the top 50,000 predicates for each model to form the feature set.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="217" end_page="219" type="metho">
    <SectionTitle>
5 External Knowledge Sources
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="217" end_page="217" type="sub_section">
      <SectionTitle>
5.1 Lexical Dependencies
</SectionTitle>
      <Paragraph position="0"> Features derived from n-grams of words and tags in the immediate vicinity of the word being tagged have underpinned the world of POS tagging for many years (Kupiec, 1992; Merialdo, 1994; Ratnaparkhi, 1996), and have proven to be useful features in WSD (Yarowsky, 1993). Lower-order n-grams which are closer to word being tagged offer the greatest predictive power (Black et al., 1998). However, in the field of WSD, relational information extracted from grammatical analysis of the sentence has been employed to good effect, and in particular, subject-object relationships between verbs and nouns have been shown be effective in disambiguating semantics (Nancy and Jean, 1998). We take the broader view that dependency relationships in general between any classes of words may help, and use the ME training process to weed out the irrelevant relationships. The principle is exactly the same as when using a word in the local context as a feature, except that the word inthiscasehasagrammaticalrelationshipwiththe word being tagged, and can be outside the local neighborhood of the word being tagged. For both types of dependency, we encoded the model constraints fstl(d) as boolean functions of the form:</Paragraph>
      <Paragraph position="2"> where: - d is a lexical dependency, consisting of a source word (the word being tagged) d.s, a target word d.t and a label d.l - s and t (words), and l (link label) are specific to the feature We generated two distinct features for each dependency. The source and target were exchanged to create these features. This was to allow the models to capture the bidirectional nature of the dependencies. For example, when tagging a verb, the model should be aware of the dependent object, and conversely when tagging that object, the model should have a feature imposing a constraint arising from the identity of the dependent verb.  We parsed our corpus using the parser detailed in (Grinberg et al., 1995). The dependencies output by this parser are labeled with the type of dependency (connector) involved. For example, subjects (connector type S) and direct objects of verbs (O) are explicitly marked by the process (a full list of connectors is provided in the paper). We used all of the dependencies output by the parser as features in the models.</Paragraph>
      <Paragraph position="3">  It is possible to extract lexical dependencies from a phrase-structure parse. The procedure is explained in detail in (Collins, 1996). In essence, each non-terminal node in the parse tree is assigned a head word, which is the head of one of its children denoted the 'head child'. Dependencies are established between this headword and the heads of each of the children (except for the head child). In these experiments we used the</Paragraph>
      <Paragraph position="5"> trees to the corpus. The parser had a 98.9% coverage of the sentences in our corpora. Again, all of the dependencies output by the parser were used as features in the models.</Paragraph>
    </Section>
    <Section position="2" start_page="217" end_page="219" type="sub_section">
      <SectionTitle>
5.2 Hierarchical Word Ontologies
</SectionTitle>
      <Paragraph position="0"> In this section we consider the effect of features derived from hierarchical sets of words. The primary advantage is that we are able to construct these hierarchies using knowledge from outside the training corpus of the tagger itself, and thereby glean knowledge about rare words. In these experiments we use the human annotated word taxonomy of hypernyms (IS-A relations) in the Word-Net database, and an automatically acquired ontology made by clustering words in a large corpus of unannotated text.</Paragraph>
      <Paragraph position="1"> We have chosen to use hierarchical schemes for both the automatic and manually acquired ontologies because this offers the opportunity to combat data-sparseness issues by allowing features derived from all levels of the hierarchy to be used.</Paragraph>
      <Paragraph position="2"> The process of training the model is able to de- null cide the levels of granularity that are most useful for disambiguation. For the purposes of generating features for the ME tagger we treat both types of hierarchy in the same fashion. One of these features is illustrated in Figure 5.3. Each predicate is effectively a question which asks whether the word (or word being used in a particular sense in the case of the WordNet hierarchy) is a descendent of the node to which the predicate applies. These predicates become more and more general as one moves up the hierarchy. For example in the hierarchy shown in Figure 5.2, looking at the nodes on the right hand branch, the lowest node represents the class of apple trees whereas the top node represents the class of all plants.</Paragraph>
      <Paragraph position="3"> We expect these hierarchies to be particularly useful when tagging out of vocabulary words (OOV's). The identity of the word being tagged is by far the most important feature in our baseline model. When tagging an OOV this information is not available to the tagger. The automatic clustering has been trained on 100 times as much data as our tagger, and therefore will have information about words that tagger has not seen during training. To illustrate this point, suppose that we are taggingtheOOVpomegranate. Thiswordisinthe WordNet database, and is in the same synset as the 'fruit' sense of the word apple. It is reasonable to assume that the model will have learned (from the many examples of all fruit words) that the predicate representing membership of this fruit synset should, iftrue,favortheselectionofthecorrecttag for fruit words: NN1FOOD. The predicate will be true for the word pomegranate which will thereby benefit from the model's knowledge of how to tag the other words in its class. Even if this is not so at this level in the hierarchy, it is likely to be so at some level of granularity. Precisely which levels of detail are useful will be learned by the model during training.</Paragraph>
      <Paragraph position="4">  We used the automatic agglomerative mutualinformation-based clustering method of (Ushioda, 1996) to form hierarchical clusters from approximately 50 million words of tokenized, unannotated text drawn from similar domains as the tree-bank used to train the tagger. Figure 5.2 shows the position of the word apple within the hierarchy of clusters. This example highlights both the strengths and weaknesses of this approach. One strength is that the process of clustering proceeds in a purely objective fashion and associations between words that may not have been considered by a human annotator are present. Moreover, the clustering process considers all types that actually occur in the corpus, and not just those words that might appear in a dictionary (we will return to this later). A major problem with this approach is that  the clusters tend to contain a lot of noise. Rare  wordscaneasilyfindthemselvesmembersofclusters to which they do not seem to belong, by virtue of the fact that there are too few examples of the word to allow the clustering to work well for these words. This problem can be mitigated somewhat by simply increasing the size of the text that is clustered. However the clustering process is computationally expensive. Another problem is that a word may only be a member of a single cluster; thus typically the cluster set assigned to a word will only be appropriate for that word when used in its most common sense.</Paragraph>
      <Paragraph position="5"> Approximately93%ofrunningwordsinthetest corpus, and 95% in the training corpus were covered by the words in the clusters (when restricted to verbs, nouns, adjectives and adverbs, these figures were 94.5% and 95.2% respectively). Approximately 81% of the words in the vocabulary from the test corpus were covered, and 71% of the training corpus vocabulary was covered.</Paragraph>
      <Paragraph position="6">  For this class of features, we used the hypernym taxonomy of WordNet (Fellbaum, 1998). Figure 5.2 shows the WordNet hypernym taxonomy for the two senses of the word apple that are in the database. The set of predicates query membership of all levels of the taxonomy for all WordNet senses of the word being tagged. An example of one such predicate is shown in the figure.</Paragraph>
      <Paragraph position="7"> Only 63% of running words in both the training and the test corpus were covered by the words in the clusters. Although this figure appears low, it can be explained by the fact that WordNet only contains entries for words that have senses in certain parts of speech. Some very frequent classes of words, for example determiners, are not in Word-Net. The coverage of only nouns, verbs, adjectives  andadverbsinrunningtextis94.5%forbothtraining and test sets. Moreover, approximately 84% of the words in the vocabulary from the test corpus were covered, and 79% on the training corpus. Thus, the effective coverage of WordNet on the important classes of words is similar to that of the automatic clustering method.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML