File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0710_intro.xml
Size: 4,790 bytes
Last Modified: 2025-10-06 14:00:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0710"> <Title>Learning Distributed Linguistic Classes</Title> <Section position="3" start_page="0" end_page="55" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Supervised learning methods applied to natural language classification tasks commonly operate on high-level symbolic representations, with linguistic classes that are usually monadic, without internal structure (Daelemans et al., 1996; Cardie et al., 1999; Roth, 1998). This contrasts with the distributed class encoding commonly found in neural networks (Schmid, 1994). Error-correcting output codes (ECOC) have been introduced to machine learning as a principled and successful approach to distributed class encoding (Dietterich and Bakiri, 1995; Ricci and Aha, 1997; Berger, 1999). With ECOC, monadic classes are replaced by codewords, i.e. binary-valued vectors. An ensemble of separate classifiers (dichotomizers) must be trained to learn the binary subclassifications for every instance in the training set. During classification, the bit predictions of the various dichotomizers are combined to produce a codeword prediction. The class codeword which has minimal Hamming distance to the predicted codeword determines the classification of the instance. Codewords are constructed such that their Hamming distance is maximal. Extra bits are added to allow for error recovery, allowing the correct class to be determinable even if some bits are wrong. An error-correcting output code for a k-class problem constitutes a matrix with k rows and 2 k-1-1 columns. Rows are the codewords corresponding to classes, and columns are binary subclassifications or bit functions fi such that, for an instance e, and its codeword vector</Paragraph> <Paragraph position="2"> (~-i(v) the i-th coordinate of vector v). If the minimum Hamming distance between every codeword is d, then the code has an error-correcting capability of \[-~J. Figure 1 shows the 5 x 15 ECOC matrix, for a 5-class problem.</Paragraph> <Paragraph position="3"> In this code, every codeword has a Hamming distance of at least 8 to the other codewords, so this code has an error-correcting capability of 3 bits. ECOC have two natural interpreta- null Figure h ECOC for a five-class problem.</Paragraph> <Paragraph position="4"> tions. From an information-theoretic perspective, classification with ECOC is like channel coding (Shannon, 1948): the class of a pattern to be classified is a datum sent over a noisy communication channel. The communication channel consists of the trained classifier. The noise consists of the bias (systematic error) and variance (training set-dependent error) of the classifier, which together make up for the overall error of the classifier. The received message must be decoded before it can be interpreted as a classification. Adding redundancy to a signal before transmission is a well-known technique in digital communication to allow for the recovery of errors due to noise in the channel, and this is the key to the success of ECOC. From a machine learning perspective, an error-correcting output code uniquely partitions the instances in the training set into two disjoint subclasses, 0 or 1. This can be interpreted as learning a set of class boundaries. To illustrate this, consider the following binary code for a three-class problem. (This actually is a one-of-c code with no error-correcting capability (the minimal Hamming distance between the codewords is 1). As such it is an error-correcting code with lowest error correction, but it serves to illustrate the</Paragraph> <Paragraph position="6"> For every combination of classes (C1-C2, C1-C3, C2-C3), the Hamming distance between the codewords is 2. These horizontal relations have vertical repercussions as well: for every such pair, two bit functions disagree in the classes they select. For C1-C2, f2 selects C2 and f3 selects C1. For C1-C3, fl selects C3 and f3 selects C1. Finally, for C2-C3, fl selects C3 and f2 selects C2. So, every class is selected two times, and this implies that every class boundary associated with that class in the feature hyperspace is learned twice. In general (Kong and Dietterich, 1995), if the minimal Hamming distance between the codewords of an (error-correcting) code is d, then every class boundary is learned d times. For the error-correcting code from above this implies an error correction of zero: only two votes support a class boundary, and no vote can be favored in case of a conflict. The decoding of the predicted bit string to a class symbol appears to be a form of voting over class boundaries (Kong and Dietterich, 1995), and is able to reduce both bias and variance of the classifier.</Paragraph> </Section> class="xml-element"></Paper>