File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2014_intro.xml

Size: 10,614 bytes

Last Modified: 2025-10-06 14:04:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2014">
  <Title>Augmenting a Hidden Markov Model for Phrase-Dependent Word Tagging</Title>
  <Section position="2" start_page="0" end_page="94" type="intro">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The paper describes refinements that are currently being investigated in a model for part-of-speech assignment to words in unrestricted text. The model has the advantage that a pre-tagged training corpus is not required. Words are represented by equivalence classes to reduce the number of parameters required and provide an essentially vocabulary-independent model. State chains are used to model selective higher-order conditioning in the model, which obviates the proliferation of parameters attendant in uniformly higher-order models. The structure of the state chains is based on both an analysis of errors and linguistic knowledge.</Paragraph>
    <Paragraph position="1"> Examples show how word dependency across phrases can be modeled.</Paragraph>
    <Paragraph position="2"> Introduction The determination of part-of-speech categories for words is an important problem in language modeling, because both the syntactic and semantic roles of words depend on their part-of-speech category (henceforth simply termed &amp;quot;category&amp;quot;). Application areas include speech recognition/synthesis and information retrieval. Several workers have addressed the problem of tagging text. Methods have ranged from locally-operating rules (Greene and Rubin, 1971), to statistical methods (Church, 1989; DeRose, 1988; Garside, Leech and Sampson, 1987; Jelinek, 1985) and back-propagation (Benello, Mackie and Anderson, 1989; Nakamura and Shikano, 1989).</Paragraph>
    <Paragraph position="3"> The statistical methods can be described in terms of Markov models. States in a model represent categories {cl...c=} (n is the number of different categories used). In a first order model, Ci and Ci_l are random variables denoting the categories of the words at position i and (i - 1) in a text. The transition probability</Paragraph>
    <Paragraph position="5"> category %. A word at position i is represented by the random variable Wi, which ranges over the vocabulary {w~ ...wv} (v is the number of words in the vocabulary). State-dependent probabilities of the form P(Wi = Wa \] Ci = cz) represent the probability that word Wa is seen, given category c~. For instance, the word &amp;quot;dog&amp;quot; can  be seen in the states noun and verb, and only has a non-zero probability in those states. A word sequence is considered as being generated from an underlying sequence of categories. Of all the possible category sequences from which a given word sequence can be generated, the one which maximizes the probability of the words is used. The Viterbi algorithm (Viterbi, 1967) will find this category sequence. The systems previously mentioned require a pre-tagged training corpus in order to collect word counts or to perform back-propagation. The Brown Corpus (Francis and Kucera, 1982) is a notable example of such a corpus, and is used by many of the systems cited above.</Paragraph>
    <Paragraph position="6"> An alternative approach taken by Jelinek, (Jelinek, 1985) is to view the training problem in terms of a &amp;quot;hidden&amp;quot; Markov model: that is, only the words of the training text are available, their corresponding categories are not known. In this situation, the Baum-Welch algorithm (Baum, 1972) can be used to estimate the model parameters. This has the great advantage of eliminating the pre-tagged corpus. It minimizes the resources required, facilitates experimentation with different word categories, and is easily adapted for use with other languages.</Paragraph>
    <Paragraph position="7"> The work described here also makes use of a hidden Markov model. One aim of the work is to investigate the quality and performance of models with minimal parameter descriptions. In this regard, word equivalence  classes were used (Kupiec, 1989). There it is assumed that the distribution of the use of a word depends on the set of categories it can assume, and words are partitioned accordingly. Thus the words &amp;quot;play&amp;quot; and &amp;quot;touch&amp;quot; are considered to behave identically, as members of the class noun-or-verb, and &amp;quot;clay&amp;quot; and &amp;quot;zinc&amp;quot;are members of the class noun. This partitioning drastically reduces the number of parameters required in the model, and aids reliable estimation using moderate amounts of training data. Equivalence classes {Eqvl ...Eqvm} replace the words {wl...Wv} (m &lt;&lt; v) and P(Eqvi I Ci) replace the parameters P(Wi I Ci). In the 21 category model reported in Kupiec (1989) only 129 equivalence classes were required to cover a 30,000 word dictionary. In fact, the number of equivalence classes is essentially independent of the size of the dictionary, enabling new words to be added without any modification to the model.</Paragraph>
    <Paragraph position="8"> Obviously, a trade-off is involved. For example, &amp;quot;dog&amp;quot; is more likely to be a noun than a verb and &amp;quot;see&amp;quot; is more likely to be a verb than a noun. However they are both members of the equivalence class noun-or- verb, and so are considered to behave identically. It is then local word context (embodied in the transition probabilities) which must aid disambiguation of the word. In practice, word context provides significant constraint, so the trade-off appears to be a remarkably favorable one.</Paragraph>
    <Paragraph position="9"> The Basic Model The development of the model was guided by evaluation against a simple basic model (much of the development of the model was prompted by an analysis of the errors in its hehaviour). The basic model contained states representing the following categories:  Including comparative and superlative Am, is, was, has, have, should, must, can, might, etc.</Paragraph>
    <Paragraph position="10"> Including gerund Including past tense When, what, why, etc.</Paragraph>
    <Paragraph position="11"> Words whose stems could not be found in dictionary.</Paragraph>
    <Paragraph position="12"> Used to tag common symbols in the the Lisp programming language (see below:) &amp;quot;To&amp;quot; acting as an infinitive marker The above states were arranged in a first-order, fully connected network, each state having a transition to every other state, allowing all possible sequences of categories. The training corpus was a collection of electronic mail messages concerning the design of the Common-Lisp programming language - a somewhat less than ideal representation of English. Many Lisp-specific words were not in the vocabulary, and thus tagged as unknown, however the lisp category was nevertheless created for frequently occurring Lisp symbols in an attempt to reduce bias in the estimation. It is interesting to note that the model performs very well, despite such &amp;quot;noisy&amp;quot; training data. The training was sentence-based, and the model was trained using 6,000 sentences from the corpus. Eight iterations of the Baum-Welch algorithm were used.</Paragraph>
    <Paragraph position="13"> The implementation of the hidden Markov model is based on that of Rabiner, Levinson and Sondhi (1983). By exploiting the fact that the matrix of probabilities P(Eqvi I Ci) is sparse, a considerable improvement can be gained over the basic training algorithm in which iterations are made over all states. The initial values of the model parameters are calculated from word occurrence probabilities, such that words are initially 9\] assumed to function equally probably as any of their possible categories. Superlative and comparative adjectives were collapsed into a single adjective category, to economize on the overall number of categories. (If desired, after tagging the finer category can be replaced). In the basic model all punctuation except sentence boundaries was ignored. An interesting observation is worth noting with regard to words that can act both as auxiliary and main verbs. Modal auxiliaries were consistently tagged as auxiliary whereas the tagging for other auxiliaries (e.g. &amp;quot;is .... have&amp;quot; etc.) was more variable. This indicates that modal auxiliaries can be recognized as a natural class via their pattern of usage.</Paragraph>
    <Paragraph position="14"> Extending the Basic Model The basic model was used as a benchmark for successive improvements. The first addition was the correct treatment of all non-words in a text. This includes hyphenation, punctuation, numbers and abbreviations. New categories were added for number, abbreviation, and comma. All other punctuation was collapsed into the single new punctuation category.</Paragraph>
    <Section position="1" start_page="92" end_page="94" type="sub_section">
      <SectionTitle>
Refinement of Basic Categories
</SectionTitle>
      <Paragraph position="0"> The verb states of the basic model were found to be too coarse. For example, many noun/verb ambiguities in front of past participles were incorrectly tagged as verbs. The replacement of the auxiliary category by the following categories greatly improved this:  Common words occur often enough to be estimated reliably. In a ranked list of words in the corpus the most frequent 100 words account for approximately 50% of the total tokens in the corpus, and thus data is available to estimate them reliably. The most frequent 100 words of the corpus were assigned individually in the model, thereby enabling them to have different distributions over their categories. This leaves 50% of the corpus for training all the other equivalence classes.</Paragraph>
      <Paragraph position="1"> Editing the Transition Structure A common error in the basic model was the assignment of the word &amp;quot;to&amp;quot; to the to-infcategory (&amp;quot;to&amp;quot; acting as an infinitive marker) instead of preposition before noun phrases. This is not surprising, because &amp;quot;to&amp;quot; is the only member of the to-inf category, P(Wi = &amp;quot;to&amp;quot; \[ Ci = to-in\]) = 1.0. In contrast, P(Wi = &amp;quot;to&amp;quot; I Ci = preposition) = 0.086, because many other words share the preposition state. Unless transition probabilities are highly constraining, the higher probability paths will tend to go through the to-infstate. This situation may be addressed in several ways, the simplest being to initially assign zero transition probabilities from the to-infstate to states other than verbs and the adverb state.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML