File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2402_intro.xml

Size: 5,012 bytes

Last Modified: 2025-10-06 14:02:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2402">
  <Title>Semantic Lexicon Construction: Learning from Unlabeled Data via Spectral Analysis</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Entity detection plays an important role in information extraction systems. Whether entity recognizers employ machine learning techniques or rule-based approaches, it is useful to have a gazetteer of words1 that reliably suggest target entity class membership. This paper considers the task of generating such gazetteers from a large unannotated corpus with minimal manual effort. Starting from a small number of labeled examples (seeds), e.g., a0 &amp;quot;car&amp;quot;, &amp;quot;plane&amp;quot;, &amp;quot;ship&amp;quot; a1 labeled as vehicles, we seek to automatically collect more of these.</Paragraph>
    <Paragraph position="1"> This task is sometimes called the semi-automatic construction of semantic lexicons, e.g. (Riloff and Shepherd, 1997; Roark and Charniak, 1998; Thelen and Riloff, 2002; Phillips and Riloff, 2002). A common trend in prior studies is bootstrapping, which is an iterative process to collect new words and regard the words newly collected with high confidence as additional labeled examples for the next iteration. The aim of bootstrapping is to compensate for the paucity of labeled examples.</Paragraph>
    <Paragraph position="2"> However, its potential danger is label 'contamination' -namely, wrongly (automatically) labeled examples may 1Our argument in this paper holds for relatively small linguistic objects including words, phrases, collocations, and so forth. For simplicity, we refer to words.</Paragraph>
    <Paragraph position="3"> misdirect the succeeding iterations. Also, low frequency words are known to be problematic. They do not provide sufficient corpus statistics (e.g., how frequently the word occurs as the subject of &amp;quot;said&amp;quot;), for adequate label prediction.</Paragraph>
    <Paragraph position="4"> By contrast, we focus on improving feature vector representation for use in standard linear classifiers. To counteract data sparseness, we employ subspace projection where subspaces are derived by singular value decomposition (SVD). In this paper, we generally call such SVD-based subspace construction spectral analysis.</Paragraph>
    <Paragraph position="5"> Latent Semantic Indexing (LSI) (Deerwester et al., 1990) is a well-known application of spectral analysis to word-by-document matrices. Formal analyses of LSI were published relatively recently, e.g., (Papadimitriou et al., 2000; Azar et al., 2001). Ando and Lee (2001) show the factors that may affect LSI's performance by analyzing the conditions under which the LSI subspace approximates an optimum subspace. Our theoretical basis is partly derived from this analysis. In particular, we replace the abstract notion of 'optimum subspace' with a precise definition of a subspace useful for our task.</Paragraph>
    <Paragraph position="6"> The essence of spectral analysis is to capture the most prominently observed vector directions (or sub-vectors) into a subspace. Hence, we should apply spectral analysis only to 'good' feature vectors so that useful portions are captured into the subspace, and then factor out 'harmful' portions of all the vectors via subspace projection. We first formalize the notion of harmful portions of the commonly used feature vector representation. Experimental results show that this new strategy significantly improves label prediction performance. For instance, when trained with 300 labeled examples and unlabeled data, the proposed method rivals Naive Bayes classifiers trained with 7500 labeled examples.</Paragraph>
    <Paragraph position="7"> In general, generation of labeled training data involves expensive manual effort, while unlabeled data can be easily obtained in large amounts. This fact has motivated supervised learning with unlabeled data, such as co-training (e.g., Blum and Mitchell (1998)). The method we propose (called Spectral) can also be regarded as exploiting unlabeled data for supervised learning. The main difference from co-training or popular EM-based approaches is that the process of learning from unlabeled data (via spectral analysis) does not use any class information. It encodes learned information into feature vectors - which essentially serves as prediction of unseen feature occurrences - for use in supervised classification. The absence of class information during the learning process may seem to be disadvantageous. On the contrary, our experiments show that Spectral consistently outperforms all the tested methods that employ techniques such as EM and co-training.</Paragraph>
    <Paragraph position="8"> We formalize the problem in Section 2, and propose the method in Section 3. We discuss related work in Section 4. Experiments are reported in Section 5, and we conclude in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML