File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2017_metho.xml

Size: 10,651 bytes

Last Modified: 2025-10-06 14:10:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2017">
  <Title>Computing Term Translation Probabilities with Generalized Latent Semantic Analysis</Title>
  <Section position="3" start_page="0" end_page="152" type="metho">
    <SectionTitle>
2 Term Translation Probabilities in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="151" type="sub_section">
      <SectionTitle>
Language Modelling
</SectionTitle>
      <Paragraph position="0"> The language modelling approach (Ponte and Croft, 1998) proved very effective for the information retrieval task. This method assumes that every document defines a multinomial probability distribution p(w|d) over the vocabulary space. Thus, given a query q = (q1,...,qm), the likelihood of the query is estimated using the document's distribution: p(q|d) = producttextm1 p(qi|d), where  qi are query terms. Relevant documents maximize p(d|q) [?] p(q|d)p(d).</Paragraph>
      <Paragraph position="1"> Many relevant documents may not contain the same terms as the query. However, they may contain terms that are semantically related to the query terms and thus have high probability of being &amp;quot;translations&amp;quot;, i.e. re-formulations for the query words.</Paragraph>
      <Paragraph position="2"> Berger et. al (Berger and Lafferty, 1999) introduced translation probabilities between words into the document-to-query model as a way of semantic smoothing of the conditional word probabilities. Thus, they query-document similarity is computed as</Paragraph>
      <Paragraph position="4"> Each document word w is a translation of a query term qi with probability t(qi|w). This approach showed improvements over the baseline language modelling approach (Berger and Lafferty, 1999).</Paragraph>
      <Paragraph position="5"> The estimation of the translation probabilities is, however, a difficult task. Lafferty and Zhai used a Markov chain on words and documents to estimate the translation probabilities (Lafferty and Zhai, 2001). We use the Generalized Latent Semantic Analysis to compute the translation probabilities. null</Paragraph>
    </Section>
    <Section position="2" start_page="151" end_page="151" type="sub_section">
      <SectionTitle>
2.1 Document Similarity
</SectionTitle>
      <Paragraph position="0"> We propose to use low dimensional term vectors for inducing the translation probabilities between terms. Wepostpone thediscussion ofhowtheterm vectors are computed to section 2.2. To evaluate the validity of this approach, we applied it to document classification.</Paragraph>
      <Paragraph position="1"> We used two methods of computing the similarity between documents. First, we computed the language modelling score using term translation probabilities. Once the term vectors are computed, the document vectors are generated as linear combinations of term vectors. Therefore, we also used the cosine similarity between the documents to perform classificaiton.</Paragraph>
      <Paragraph position="2"> We computed the language modelling score of a test document d relative to a training document di as</Paragraph>
      <Paragraph position="4"> Appropriately normalized values of the cosine similarity measure between pairs of term vectors cos(vectorv, vectorw) are used as the translation probability between the corresponding terms t(v|w).</Paragraph>
      <Paragraph position="5"> In addition, we used the cosine similarity between the document vectors</Paragraph>
      <Paragraph position="7"> where adiw and bdjv represent the weight of the terms w and v with respect to the documents di and dj, respectively.</Paragraph>
      <Paragraph position="8"> Inthis case, theinner products between the term vectors are also used to compute the similarity between the document vectors. Therefore, the cosine similarity between the document vectors also depends on the relatedness between pairs of terms. We compare these two document similarity scores to the cosine similarity between bag-of-word document vectors. Our experiments show that these two methods offer an advantage for document classification.</Paragraph>
    </Section>
    <Section position="3" start_page="151" end_page="152" type="sub_section">
      <SectionTitle>
2.2 Generalized Latent Semantic Analysis
</SectionTitle>
      <Paragraph position="0"> We use the Generalized Latent Semantic Analysis (GLSA)(Matveeva et al., 2005) to compute semantically motivated term vectors.</Paragraph>
      <Paragraph position="1"> TheGLSAalgorithm computes thetermvectors for the vocabulary of the document collection C with vocabulary V using a large corpus W. It has the following outline:  1. Construct the weighted term document matrix D based on C 2. For the vocabulary words in V, obtain a matrix of pair-wise similarities, S, using the large corpus W 3. Obtain the matrix UT of low dimensional vector space representation of terms that preserves the similarities in S, UT [?] Rkx|V| 4. Compute document vectors by taking linear  combinations of term vectors ^D = UTD The columns of ^D are documents in the k-dimensional space.</Paragraph>
      <Paragraph position="2"> In step 2 we used point-wise mutual information (PMI) as the co-occurrence based measure of semantic associations between pairs of the vocabulary terms. PMI has been successfully applied to semantic proximity tests for words (Turney, 2001; Terra and Clarke, 2003) and was also successfully used as a measure of term similarity to compute document clusters (Pantel and Lin, 2002). In  our preliminary experiments, the GLSA with PMI showed a better performance than with other co-occurrence based measures such as the likelihood ratio, and kh2 test.</Paragraph>
      <Paragraph position="3"> PMI between random variables representing two words, w1 and w2, is computed as</Paragraph>
      <Paragraph position="5"> We used the singular value decomposition (SVD) in step 3 to compute GLSA term vectors.</Paragraph>
      <Paragraph position="6"> LSA (Deerwester et al., 1990) and some other related dimensionality reduction techniques, e.g. Locality Preserving Projections (He and Niyogi, 2003) compute a dual document-term representation. The main advantage of GLSA is that it focuses on term vectors which allows for a greater flexibility in the choice of the similarity matrix.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="152" end_page="153" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> The goal of the experiments was to understand whether the GLSA term vectors can be used to model the term translation probabilities. We used a simple k-NN classifier and a basic baseline to evalute the performance. We used the GLSA-based term translation probabilities within the language modelling framework and GLSA document vectors.</Paragraph>
    <Paragraph position="1"> We used the 20 news groups data set because previous studies showed that the classification performance on this document collection can noticeably benefit from additional semantic information (Bekkerman et al., 2003). For the GLSA computations we used the terms that occurred in at least 15 documents, and had a vocabulary of 9732 terms. We removed documents with fewer than 5 words. Here we used 2 sets of 6 news groups. Groupd contained documents from dissimilar news groups1, with a total of 5300 documents. Groups contained documents from more similar news groups2 and had 4578 documents.</Paragraph>
    <Section position="1" start_page="152" end_page="152" type="sub_section">
      <SectionTitle>
3.1 GLSA Computation
</SectionTitle>
      <Paragraph position="0"> To collect the co-occurrence statistics for the similarities matrix S we used the English Gigaword collection (LDC). We used 1,119,364 New York Times articles labeled &amp;quot;story&amp;quot; with 771,451 terms. 1os.ms, sports.baseball, rec.autos, sci.space, misc.forsale, religion-christian 2politics.misc, politics.mideast, politics.guns, religion.misc, religion.christian, atheism  We used the Lemur toolkit3 to tokenize and index the document; we used stemming and a list of stopwords. Unlessstated otherwise, fortheGLSA methods we report the best performance over different numbers of embedding dimensions.</Paragraph>
      <Paragraph position="1"> Theco-occurrence counts can be obtained using either term co-occurrence within the same document or within a sliding window of certain fixed size. In our experiments we used the window-based approach which was shown to give better results (Terra and Clarke, 2003). We used the window of size 4.</Paragraph>
    </Section>
    <Section position="2" start_page="152" end_page="153" type="sub_section">
      <SectionTitle>
3.2 Classification Experiments
</SectionTitle>
      <Paragraph position="0"> We ran the k-NN classifier with k=5 on ten random splits of training and test sets, with different numbers of training documents. The baseline was to use the cosine similarity between the bag-of-words document vectors weighted with term frequency. Other weighting schemes such as maximum likelihood and Laplace smoothing did not improve results.</Paragraph>
      <Paragraph position="1"> Table 1 shows the results. We computed the score between the training and test documents using two approaches: cosine similarity between the GLSA document vectors according to Equation 3 (denoted as GLSA), and the language modelling score which included the translation probabilities between the terms as in Equation 2 (denoted as  LM). We used the term frequency as an estimate for p(w|d). To compute the matrix of translation probabilities P, where P[i][j] = t(tj|ti) for the LMCLSA approach, we first obtained the matrix ^P[i][j] = cos(vectorti,vectortj). We set the negative and zero entries in ^P to a small positive value. Finally, we normalized the rows of ^P to sum up to one.</Paragraph>
      <Paragraph position="2"> Table 1 shows that for both settings GLSA and LM outperform the tf document vectors. As expected, the classification task was more difficult for the similar news groups. However, in this case both GLSA-based approaches outperform the baseline. In both cases, the advantage is more significant with smaller sizes of the training set. GLSA and LM performance usually peaked at around 300-500 dimensions which is in line with results for other SVD-based approaches (Deerwester et al., 1990). When the highest accuracy was achieved at higher dimensions, the increase after 500 dimensions was rather small, as illustrated in Figure 1.</Paragraph>
      <Paragraph position="3"> These results illustrate that the pair-wise similarities between the GLSA term vectors add important semantic information which helps to go beyond term matching and deal with synonymy and polysemy.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML