XML Viewer - w04-2402

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2402_metho.xml
Size: 19,181 bytes
Last Modified: 2025-10-06 14:09:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2402">
  <Title>Semantic Lexicon Construction: Learning from Unlabeled Data via Spectral Analysis</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word Classification Problem
</SectionTitle>
    <Paragraph position="0"> The problem is to classify words (as lexical items) into the entity classes that are most likely referred to by their occurrences, where the notion of 'most likely' is with respect to the domain of the text2.</Paragraph>
    <Paragraph position="1"> More formally, consider all the possible instances of word occurrences (including their context) in the world, which we call set a0 , and assume that each word occurrence in a0 refers to one of the entity classes in set a1 (e.g.,</Paragraph>
    <Paragraph position="3"> that observed word occurrences (i.e., corpora) are independently drawn from a0 according to some probability distribution a4 . An example of a4 might be the distribution observed in all the newspaper articles in 1980's, or the distribution observed in biomedical articles. That is, a4 represents the assumed domain of text.</Paragraph>
    <Paragraph position="4"> We define a5a7a6a9a8a11a10a13a12 to be the entity class most likely referred to by word a10 's occurrences in the assumed domain of text, i.e., a5a14a6a15a8a11a10a13a12 a2 a16a18a17a20a19a9a21a22a16a24a23a7a25a27a26a29a28a31a30 a8a11a32 refers to a33a35a34a36a32 is an occurrence of a10a13a12a38a37 given that a32 is arbitrarily drawn from a0 according to a4 . Then, our word classification problem is to predict a5a7a6 -labels of all the words (as lexical items) in a given word set a39 , when the following resources are available: a40 An unannotated corpus of the domain of interest which we regard as unlabeled word occurrences arbitrarily drawn from a0 according to a4 . We assume that all the words in a39 appear in this corpus.</Paragraph>
    <Paragraph position="5"> a40 Feature extractors. We assume that some feature extractors a41 are available, which we can apply to word occurrences in the above unannotated corpus.</Paragraph>
    <Paragraph position="6"> Feature a41a38a8a11a32a42a12 might be, for instance, the set of head nouns that participate in list construction with the focus word of a32 .</Paragraph>
    <Paragraph position="7"> 2E.g., &amp;quot;plant&amp;quot; might be most likely to be a living thing if it occurred in gardening books, but it might be most likely to be a facility in newspaper articles.</Paragraph>
    <Paragraph position="8"> a40 Seed words and their a5a14a6 labels. We assume that the a5a7a6 -labels of several words in a39 are revealed as labeled examples.</Paragraph>
    <Paragraph position="9"> Note that in this task configuration, test data is known at the time of training (as in the transductive setting). Although we do not pursue transductive learning techniques (e.g., Vapnik (1998)) in this work, we will set up the experimental framework accordingly.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Using Vector Similarity
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Error Factors
</SectionTitle>
      <Paragraph position="0"> Consider a straightforward feature vector representation using normalized joint counts of features and the word, which we call count vector a43a44a46a45 . More formally, the a47 -th element of a43</Paragraph>
      <Paragraph position="2"> the count of events observed in the unannotated corpus.</Paragraph>
      <Paragraph position="3"> One way to classify words would be to compare count vectors for seeds and words and to choose the most similar seeds, using inner products as the similarity measure.</Paragraph>
      <Paragraph position="4"> Let us investigate the factors that may affect the performance of such inner product-based label prediction. Let  a54 a25 (for class a33 ) be the vectors of feature occurrence probabilities, so that their a47 -th elements are a30 a8a49a48a24a50a20a34a10a13a12</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 and a30
</SectionTitle>
    <Paragraph position="0"> a8a49a48a24a50a20a34a33a55a12 , respectively. Now we set vec-</Paragraph>
    <Paragraph position="2"> a45 is a vector of the difference between true (but unknown) feature occurrence probabilities and their maximum likelihood estimations. We call</Paragraph>
    <Paragraph position="4"> ror.</Paragraph>
    <Paragraph position="5"> If occurrences of word a10 and features are conditionally independent given labels, then a43 a57 a45 is zero4. Therefore, we call a43 a57 a45 , dependency. It would be ideal (even if unrealistic) if the dependency were zero so that features convey class information rather than information specific to a10 . Now consider the conditions under which a word pair with the same label has a larger inner product than the pair with different labels. It is easy to show that, with feature extractors fixed to reasonable ones, smaller estimation errors and smaller dependency ensure better performance of label prediction, in terms of lower-bound analysis. More precise descriptions are found in the Appendix. 3a64a9a65a61a66a68a67a27a69a70a72a71 denotes the probability that feature a66a55a67 is in</Paragraph>
    <Paragraph position="7"> given that a74 is an occurrence of word a70 , where a74 is randomly drawn from a76 according to a77 .</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Spectral analysis for classifying words
</SectionTitle>
      <Paragraph position="0"> We seek to remove the above harmful portions a43</Paragraph>
      <Paragraph position="2"> a45 from count vectors -- which correspond to estimation error and feature dependency -- by employing spectral analysis and succeeding subspace projection.</Paragraph>
      <Paragraph position="3"> Background A brief review of spectral analysis is found in the Appendix. Ando and Lee (2001) analyze the conditions under which the application of spectral analysis to a term-document matrix (as in LSI) approximates an optimum subspace. The notion of 'optimum' is with respect to the accuracy of topic-based document similarities. The proofs rely on the mathematical findings known as the invariant subspace perturbation theorems proved by Davis and Kahan (1970).</Paragraph>
      <Paragraph position="4"> Approximation of the span of  a54 a25 's By adapting Ando and Lee's analysis to our problem, it can be shown that spectral analysis will approximate the span of  a54 a25 's, essentially, null a40 if the count vectors (chosen as input to spectral analysis) well-represent all the classes, and a40 if these input vectors have sufficiently small estimation errors and dependency.</Paragraph>
      <Paragraph position="5"> This is because, intuitively,  a54 a25 's are the most prominently observed sub-vectors among the input vectors in that case. (Recall that the essence of spectral analysis is to capture the most prominently observed vector directions into a subspace.) Then, the error portions can be mostly removed from any count vectors by orthogonally projecting the vectors onto the subspace, assuming error portions are mostly orthogonal to the span of  a54 a25 's.</Paragraph>
      <Paragraph position="6"> Choice of count vectors As indicated by the above two conditions, the choice of input vectors is important when applying spectral analysis. The tightness of subspace approximation depends on the degree to which those conditions are met. In fact, it is easy to choose vectors with small estimation errors so that the second condition is likely to be met. Vectors for high frequency words are expected to have small estimation errors. Hence, we propose the following procedure.</Paragraph>
      <Paragraph position="7">  1. From the unlabeled word set a39 , choose the a0 most frequent words. a0 is a sufficiently large constant. Frequency is counted in the given unannotated corpus. null 2. Generate count vectors for all the a0 words by applying a feature extractor to word occurrences in the given unannotated corpus.</Paragraph>
      <Paragraph position="8"> 3. Compute the a1 -dimensional subspace by applying  spectral analysis to the a0 count vectors generated in Step 2 5.</Paragraph>
      <Paragraph position="9"> 4. Generate count vectors (as in Step 2) for all the words (including seeds) in a39 . Generate new feature vectors by orthogonally projecting them onto the subspace6.</Paragraph>
      <Paragraph position="10"> When we have multiple feature extractors, we perform the above procedure independently for each of the feature extractors, and concatenate the vectors in the end. Hereafter, we call this procedure and the vectors obtained in this manner Spectral and spectral vectors, respectively. Spectral vectors serve as feature vectors for a linear classifier for classifying words.</Paragraph>
      <Paragraph position="11"> Note that we do not claim that the above conditions for subspace approximation are always satisfied. Rather, we consider them as insight into spectral analysis on this task, and design the method so that the conditions are likely to be met.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 The number of input vectors and the subspace
</SectionTitle>
      <Paragraph position="0"> dimensionality There are two parameters: a0 , the number of count vectors used as input to spectral analysis, and a1 , the dimensionality of the subspace.</Paragraph>
      <Paragraph position="1"> a0 should be sufficiently large so that all the classes are represented by the chosen vectors. However, an excessively large a0 would result in including low frequency words, which might degrade the subspace approximation. In principle, the dimensionality of the subspace a1 should be set to the number of classes a34a1a22a34, since we seek to approximate the span of</Paragraph>
      <Paragraph position="3"> ever, for the typical practice of semantic lexicon construction, a1 should be greater than a34a1a22a34 because at least one class tends to have very broad coverage - 'Others' as in a0 Person, Organization, Others a1 . It is reasonable to assume that features correlate to its (unknown) inherent subclasses rather than to such a broadly defined class itself. The dimensionality a1 should take account of the number of such subclasses.</Paragraph>
      <Paragraph position="4"> In practice, a0 and a1 need be determined empirically.</Paragraph>
      <Paragraph position="5"> We will return to this issue in Section 5.2.</Paragraph>
      <Paragraph position="6">  normalized count vectors. We compute left singular vectors of this matrix corresponding to the a5 largest singular values. The computed left singular vectors are the basis vectors of the desired subspace.</Paragraph>
      <Paragraph position="7">  vector computed in the previous step. Alternatively, one can generate the vector whosea19 -th entry is a11 a13a67 a14a16a18a17 , as it produces the same inner products, due to the orthonormality of left singular vectors.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Related Work and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Spectral analysis for word similarity
measurement
</SectionTitle>
      <Paragraph position="0"> Spectral analysis has been used in traditional factor analysis techniques (such as Principal Component Analysis) to summarize high-dimensional data. LSI uses spectral analysis for measuring document or word similarities.</Paragraph>
      <Paragraph position="1"> From our perspective, the LSI word similarity measurement is similar to the special case where we have a single feature extractor that returns the document membership of word occurrence a32 .</Paragraph>
      <Paragraph position="2"> Among numerous empirical studies of LSI, Landauer and Dumais (1997) report that using the LSI word similarity measure, 64.4% of the synonym section of TOEFL (multi-choice) were answered correctly, which rivals college students from non-English speaking countries. We conjecture that if more effective feature extractors were used, performance might be better.</Paragraph>
      <Paragraph position="3"> Sch&amp;quot;uetze (1992)'s word sense disambiguation method uses spectral analysis for vector dimensionality reduction. He reports that use of spectral analysis does not affect the task performance, either positively or negatively.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Bootstrapping methods for constructing
</SectionTitle>
      <Paragraph position="0"> semantic lexicons A common trend for the semantic lexicon construction task is that of bootstrapping, exploiting strong syntactic cues -- such as a bootstrapping method that iteratively grows seeds by using cooccurrences in lists, conjunctions, and appositives (Roark and Charniak, 1998); meta-bootstrapping which repeatedly finds extraction patterns and extracts words from the found patterns (Riloff and Jones, 1999); a co-training combination of three bootstrapping processes each of which exploits appositives, compound nouns, and ISA-clauses (Phillips and Riloff, 2002). Thelen and Riloff (2002)'s bootstrapping method iteratively performs feature selection and word selection for each class. It outperformed the best-performing bootstrapping method for this task at the time. We also note that there are a number of bootstrapping methods successfully applied to text - e.g., word sense disambiguation (Yarowsky, 1995), named entity instance classification (Collins and Singer, 1999), and the extraction of 'parts' word given the 'whole' word (Berland and Charniak, 1999).</Paragraph>
      <Paragraph position="1"> In Section 5, we report experiments using syntactic features shown to be useful by the above studies, and compare performance with Thelen and Riloff (2002)'s bootstrapping method.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Techniques for learning from unlabeled data
</SectionTitle>
      <Paragraph position="0"> While most of the above bootstrapping methods are targeted to NLP tasks, techniques such as EM and co-training are generally applicable when equipped with appropriate models or classifiers. We will present high-level and empirical comparisons (Sections 4.4 and 5, respectively) of Spectral with representative techniques for learning from unlabeled data, described below.</Paragraph>
      <Paragraph position="1"> Expectation Maximization (EM) is an iterative algorithm for model parameter estimation (Dempster et al., 1977). Starting from some initial model parameters, the E-step estimates the expectation of the hidden class variables. Then, the M-step recomputes the model parameters so that the likelihood is maximized, and the process repeats. EM is guaranteed to converge to some local maximum. It is very popular and useful, but also known to be sensitive to the initialization of parameters.</Paragraph>
      <Paragraph position="2"> The co-training paradigm proposed by Blum and Mitchell (1998) involves two classifiers employing two distinct views of the feature space, e.g., 'textual content' and 'hyperlink' of web documents. The two classifiers are first trained with labeled data. Each of the classifiers adds to the labeled data pool the examples whose labels are predicted with the highest confidence.</Paragraph>
      <Paragraph position="3"> The classifiers are trained with the new augmented labeled data, and the process repeats. Its theoretical foundations are based on the assumptions that two views are redundantly sufficient and conditionally independent given classes. Abney (2002) presents an analysis to relax the (fairly strong) conditional independence assumption to weak rule dependence.</Paragraph>
      <Paragraph position="4"> Nigam and Ghani (2000) study the effectiveness of co-training through experiments on the text categorization task. Pierce and Cardie (2001) investigate the scalability of co-training on the base noun phrase bracketing task, which typically requires a larger number of labeled examples than text categorization. They propose to manually correct labels to counteract the degradation of automatically assigned labels on large data sets. We use these two empirical studies as references for the implementation of co-training in our experiments.</Paragraph>
      <Paragraph position="5"> Co-EM (Nigam and Ghani, 2000) combines the essence of co-training and EM in an elegant way. Classifier A is initially trained with the labeled data, and computes probabilistically-weighted labels for all the unlabeled data (as in E-step). Then classifier B is trained with the labeled data plus the probabilistic labels computed by classifier A. It computes probabilistic labels for A, and the process repeats. Co-EM differs from co-training in that all the unlabeled data points are re-assigned probabilistic labels in every iteration. In Nigam and Ghani (2000)'s experiments, co-EM outperformed EM, and rivaled co-training. Based on the results, they argued for the benefit of exploiting distinct views.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> We observe two major differences between spectral analysis and the above techniques for learning from unlabeled data.</Paragraph>
      <Paragraph position="1"> Feature prediction (Spectral) vs. label prediction First, the learning processes of the above techniques are driven by the prediction of class labels on the unlabeled data. As their iterations proceed, for instance, the estimations of class-related probabilities such as a30 a8a61a33a55a12 , a30 a8a49a48a36a34a33a55a12 may be improved. On the other hand, a spectral vector can be regarded as an approximation of</Paragraph>
      <Paragraph position="3"> In that sense, spectral analysis predicts unseen feature occurrences which might be observed with word a10 if a10 had more occurrences in the corpus.</Paragraph>
      <Paragraph position="4"> Global optimization (Spectral) vs. local optimization Secondly, starting from the status initialized by labeled data, EM performs local maximization, and co-training and other bootstrapping methods proceed greedily. Consequently, they are sensitive to the given labeled data. In contrast, spectral analysis performs global optimization (eigenvector computation) independently from the labeled data. Whether or not the performed global optimization is meaningful for classification depends on the 'usefulness' of the given feature extractors. We say features are useful if dependency and feature mingling (defined in the Appendix) are small.</Paragraph>
      <Paragraph position="5"> It is interesting to see how these differences affect the performance on the word classification task. We will report experimental results in the next section.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML