File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2186_metho.xml
Size: 12,461 bytes
Last Modified: 2025-10-06 14:15:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2186"> <Title>Part of Speech Tagging Using a Network of Linear Separators</Title> <Section position="3" start_page="1136" end_page="1137" type="metho"> <SectionTitle> 2 The SNOW Approach </SectionTitle> <Paragraph position="0"> The SNOW (Sparse Network Of Linear separators) architecture is a network of threshold gates. Nodes in the first layer of the network represent the input features; target nodes (i.e., the correct values of the classifier) are represented by nodes in the second layer. Links from the first to the second layer have weights; each target node is thus defined as a (linear) function of the lower level nodes.</Paragraph> <Paragraph position="1"> For example, in POS, target nodes correspond to different part-of-speech tags. Each target node can be thought of as an autonomous network, although they all feed from the same input. The network is sparse in that a target node need not be connected to all nodes in the input layer. For example, it is not connected to input nodes (features) that were never active with it in the same sentence, or it may decide, during training, to disconnect itself from some of the irrelevant input nodes, if they were not active often enough.</Paragraph> <Paragraph position="2"> Learning in SNOW proceeds in an on-line fashion. Every example is treated autonomously by each target subnetworks. It is viewed as a positive example by a few of these and a negative example by the others. In the applications described in this paper, every labeled example is treated as positive by the target node corresponding to its label, and as negative by all others. Thus, every example is used once by all the nodes to refine their definition in terms of the others and is then discarded. At prediction time, given an input sentence s = (Zl, z2,...zm), (i.e., activating a sub-set of the input nodes) the information propagates through all the competing subnetworks; and the one which produces the highest activity gets to determine the prediction.</Paragraph> <Paragraph position="3"> A local learning algorithm, Littlestone's Winnow algorithm (Littlestone, 1988), is used at each target node to learn its dependence on other nodes. Winnow has three parameters: a threshold 0, and two update parameters, a promotion parameter c~ > 1 and a demotion parameter 0 < /3 < 1. Let ~4= {ix,...,im} be the set of active features that are linked to (a specific) target node.</Paragraph> <Paragraph position="4"> The algorithm predicts 1 (positive) iff ~'\]ie~4wi > 0, where wl is the weight on the edge connecting the ith feature to the target node. The algorithm updates its current hypothesis (i.e., weights) only when a mistake is made. If the algorithm predicts 0 and the received label is 1 the update is (promotion) Vi E .A, wi +-- ~ * wi. If the algorithm predicts 1 and the received label is 0 the update is (demotion) Vi E ~4, wi +--/3 * wi. For a study of the advantages of Winnow, see (Littlestone, 1988; Kivinen and Warmuth, 1995).</Paragraph> </Section> <Section position="4" start_page="1137" end_page="1138" type="metho"> <SectionTitle> 3 The POS Problem </SectionTitle> <Paragraph position="0"> Part of speech tagging is the problem of identifying parts of speech of words in a presented text. Since words are ambiguous in terms of their part of speech, the correct part of speech is usually identified from the context the word appears in. Consider for example the sentence The can will rust. Both can and rust can accept modal-verb, norm and verb as possible POS tags (and a few more); rust can be tagged both as noun and verb. This leads to many possible POS tagging of the sentence one of which, determiner, noun, modal-verb, verb, respectively, is correct. The problem has numerous application in information retrieval, machine translation, speech recognition, and appears to be an important intermediate stage in many natural language understanding related inferences.</Paragraph> <Paragraph position="1"> In recent years, a number of approaches have been tried for solving the problem. The most notable methods are based on Hidden Markov Models(HMM)(Kupiec, 1992; Schiitze, 1995), transformation rules(Brill, 1995; Brill, 1997), and multi-layer neural networks(Schmid, 1994).</Paragraph> <Paragraph position="2"> HMM taggers use manually tagged training data to compute statistics on features. For example, they can estimate lexical probabilities Prob(wordlta9) and contextual probabilities Prob(taglprevious n tags). On the testing stage, the taggers conduct a search in the space of POS tags to arrive at the most probable POS labeling with respect to the computed statistics.</Paragraph> <Paragraph position="3"> That is, given a sentence, the taggers assign in the sentence a sequence of tags that maximize the product of lexical and contextual probabilities over all words in the sentence.</Paragraph> <Paragraph position="4"> Transformation based learning(TBL) (Brill, 1995) is a machine learning approach for rule learning. The learning procedure is a mistake-driven algorithm that produces a set of rules. The hypothesis of TBL is an ordered list of transformations. A transformation is a rule with an antecedent t and a consequent c E C.</Paragraph> <Paragraph position="5"> The antecedent t is a condition on the input sentence. For example, a condition might be the preceding word tag is t. That is, applying the condition to a sentence s defines a feature t(s) E jr. Phrased differently, the application of the condition to a given sentence s, checks whether the corresponding feature is active in this sentence. The condition holds if and only if the feature is active in the sentence.</Paragraph> <Paragraph position="6"> The TBL hypothesis is evaluated as follows: given a sentence s, an initial labeling is assigned to it. Then, each rule is applied, in order, to the sentence. If the condition of the rule applies, the current label is replaced by the label in the consequent. This process goes on until the last rule in the list is evaluated. The last labeling is the output of the hypothesis.</Paragraph> <Paragraph position="7"> In its most general setting, the TBL hypothesis is not a classifier (Brill, 1995). The reason is that, in general, the truth value of the condition of the ith rule may change while evaluating one of the preceding rules. For example, in part of speech tagging, labeling a word with a part of speech changes the conditions of the following word that depend on that part of speech(e.g., the preceding word tag is t).</Paragraph> <Paragraph position="8"> TBL uses a manually-tagged corpus for learning the ordered list of transformations. The learning proceeds in stages, where on each stage a transformation is chosen to minimize the number of mislabeled words in the presented corpus. The transformation is then applied, and the process is repeated until no more mislabeling minimization can be achieved.</Paragraph> <Paragraph position="9"> For example, in POS, the consequence of a transformation labels a word with a part of speech. (Brill, 1995) uses lexicon for initial annotation of the training corpus, where each word in the lexicon has a set POS tags seen for the word in the training corpus. Then a search in the space of transformations is conducted to determine a transformation that most reduces the number of wrong tags for the words in the corpus. The application of the transformation to the initially labeled produces another labeling of the corpus with a smaller number of mistakes.</Paragraph> <Paragraph position="10"> Iterating this procedure leads to learning an ordered list of transformation which can be used as a POS tagger.</Paragraph> <Paragraph position="11"> There have been attempts to apply neural networks to POS tagging(e.g.,(Schmid, 1994)).</Paragraph> <Paragraph position="12"> The work explored multi-layer network architectures along with the back-propagation algorithm on the training stage. The input nodes of the network usually correspond to the tags of the words surrounding the word being tagged.</Paragraph> <Paragraph position="13"> The performance of the algorithms is comparable to that of HMM methods.</Paragraph> <Paragraph position="14"> In this paper, we address the POS problem with no unknown words (the closed world assumption) from the standpoint of SNOW. That is, we represent a POS tagger as a network of linear separators and use Winnow for learning weights of the network. The SNOW approach has been successfully applied to other problems of natural language processing(Golding and Roth, 1998; Krymolowski and Roth, 1998; Roth, 1998). However, this problem offers additional challenges to the SNOW architecture and algorithms. First, we are trying to learn a multi-class predictor, where the number of classes is unusually large(about 50) for such learning problems. Second, evaluating hypothesis in testing is done in a presence of attribute noise. The reason is that input features of the network are computed with respect to parts of speech of words, which are initially assigned from a lexicon.</Paragraph> <Paragraph position="15"> We address the first problem by restricting the parts of speech a tag for a word is selected from. Second problem is alleviated by performing several labeling cycles on the testing corpus.</Paragraph> </Section> <Section position="5" start_page="1138" end_page="1139" type="metho"> <SectionTitle> 4 The Tagger Network </SectionTitle> <Paragraph position="0"> The tagger network consists of a collection of linear separators, each corresponds to a distinct part of speech 1 . The input nodes of the network correspond to the features. The features are computed for a fixed word in a sentence. We (1) The preceding word is tagged c.</Paragraph> <Paragraph position="1"> (2) The following word is tagged e.</Paragraph> <Paragraph position="2"> (3) The word two before is tagged c.</Paragraph> <Paragraph position="3"> (4) The word two after is tagged c.</Paragraph> <Paragraph position="4"> (5) The preceding word is tagged c and the following word is tagged t.</Paragraph> <Paragraph position="5"> (6) The preceding word is tagged c and the word two before is tagged t.</Paragraph> <Paragraph position="6"> (7) The following word is tagged c and the word two after is tagged t.</Paragraph> <Paragraph position="7"> (8) The current word is w.</Paragraph> <Paragraph position="8"> (9) The most probable part of speech for the current word is c.</Paragraph> <Paragraph position="9"> The most probable part of speech for a word is taken from a lexicon. The lexicon is a list of words with a set of possible POS tags associated with each word. The lexicon can be computed from available labeled corpus data, or it can represent the a-priori information about words in the language.</Paragraph> <Paragraph position="10"> Training of the SNOW tagger network proceeds as follows. Each word in a sentence produces an example. Given a sentence, features are computed with respect to each word thereby producing a positive examples for the part of speech the word is labeled with, and the negative examples for the other parts of speech. The positive and negative examples are presented to the corresponding subnetworks, which update their weights according to Winnow.</Paragraph> <Paragraph position="11"> In testing, this process is repeated, producing a test example for each word in the sentence. In this case, however, the POS tags of the neighboring words are not known and, therefore, the majority of the features cannot be evaluated. We discuss later various ways to handle this situation. The default one is to use the base-line tags - the most common POS for this word in the training lexicon. Clearly this is not accurate, and the classification can be viewed as done in the presence of attribute noise.</Paragraph> <Paragraph position="12"> Once an example is produced, it is then presented to the networks. Each of the subnetworks is evaluated and we select the one with the highest level of activation among the separators corresponding to the possible tags for the current word. After every prediction, the tag output by the SNOW tagger for a word is used for labeling the word in the test data. There~The features I-8 are part of (Brill, 1995) features fore, the features of the following words will depend on the output tags of the preceding words.</Paragraph> </Section> class="xml-element"></Paper>