File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1025_metho.xml
Size: 11,513 bytes
Last Modified: 2025-10-06 14:14:33
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1025"> <Title>Contextual Spelling Correction Using Latent Semantic Analysis</Title> <Section position="5" start_page="166" end_page="167" type="metho"> <SectionTitle> LSA to Tribayes. 3 Latent Semantic Analysis Latent Semantic Analysis (LSA) was developed at </SectionTitle> <Paragraph position="0"> Bellcore for use in information retrieval tasks (for which it is also known as LSI) (Dumais et al., 1988; Deerwester et al., 1990). The premise of the LSA model is that an author begins with some idea or information to be communicated. The selection of particular lexical items in a collection of texts is simply evidence for the underlying ideas or information being presented. The goal of LSA, then, is to take the &quot;evidence&quot; (i.e., words) presented and uncover the underlying semantics of the text passage.</Paragraph> <Paragraph position="1"> Because many words are polysemous (have multiple meanings) and synonymous (have meanings in common with other words), the evidence available in the text tends to be somewhat &quot;noisy.&quot; LSA attempts to eliminate the noise from the data by first representing the texts in a high-dimensional space and then reducing the dimensionality of the space to only the most important dimensions. This process is described in more detail in Dumais (1988) or Deerwester (1990), but a brief description is provided here.</Paragraph> <Paragraph position="2"> A collection of texts is represented in matrix format. The rows of the matrix correspond to terms and the columns represent documents. The individual cell values are based on some function of the term's frequency in the corresponding document and its frequency in the whole collection. The function for selecting cell values will be discussed in section 4.2. A singular value decomposition (SVD) is performed on this matrix. SVD factors the original matrix into the product of three matrices. We'll identify these matrices as T, S, and D'(see Figure 1).</Paragraph> <Paragraph position="3"> The T matrix is a representation of the original term vectors as vectors of derived orthogonal factor values. D' is a similar representation for the original document vectors. S is a diagonal matrix 2 of rank r.</Paragraph> <Paragraph position="4"> It is also called the singular value matrix. The singular values are sorted in decreasing order along the diagonal. They represent a scaling factor for each dimension in the T and D' matrices.</Paragraph> <Paragraph position="5"> Multiplying T, S, and D'together perfectly reproduces the original representation of the text collection. Recall, however, that the original representation is expected to be noisy. What we really want is an approximation of the original space that eliminates the majority of the noise and captures the most important ideas or semantics of the texts.</Paragraph> <Paragraph position="6"> An approximation of the original matrix is created by eliminating some number of the least important singular values in S. They correspond to the least important (and hopefully, most noisy) dimensions in the space. This step leaves a new matrix (So) of rank k. 3 A similar reduction is made in T and D by retaining the first k columns of T and the first k rows of D' as depicted in Figure 2. The product of the resulting To, So, and D'o matrices is a least squares best fit reconstruction of the original matrix (Eckart and Young, 1939). The reconstructed matrix defines a space that represents or predicts the frequency with which each term in the space would appear in a given document or text segment given an infinite sample of semantically similar texts (Lan null the reduced matrices gives X, a least squares best fit reconstruction of the original matrix. dauer and Dumais, (In press)).</Paragraph> <Paragraph position="7"> New text passages can be projected into the space by computing a weighted average of the term vectors which correspond to the words in the new text. In the contextual spelling correction task, we can generate a vector representation for each text passage in which a confusion word appears. The similarity between this text passage vector and the confusion word vectors can be used to predict the most likely word given the context or text in which it will appear. '</Paragraph> </Section> <Section position="6" start_page="167" end_page="169" type="metho"> <SectionTitle> 4 Experimental Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="167" end_page="167" type="sub_section"> <SectionTitle> 4.1 Data </SectionTitle> <Paragraph position="0"> Separate corpora for training and testing LSA's ability to correct contextual word usage errors were created from the Brown corpus (Ku~era and Francis, 1967). The Brown corpus was parsed into individual sentences which are randomly assigned to either a training corpus or a test corpus. Roughly 80% of the original corpus was assigned as the training corpus and the other 20% was reserved as the test corpus. For each confusion set, only those sentences in the training corpus which contained words in the confusion set were extracted for construction of an LSA space. Similarly, the sentences used to test the LSA space's predictions were those extracted from the test corpus which contained words from the confusion set being examined. The details of the space construction and testing method are described below. null</Paragraph> </Section> <Section position="2" start_page="167" end_page="167" type="sub_section"> <SectionTitle> 4.2 Training </SectionTitle> <Paragraph position="0"> Training the system consists of processing the training sentences and constructing an LSA space from them. LSA requires the corpus to be segmented into documents. For a given confusion set, an LSA space is constructed by treating each training sentence as a document. In other words, each training sentence is used as a column in the LSA matrix. Before be. 168 ing processed by LSA, each sentence undergoes the following transformations: context reduction, stemming, bigram creation, and term weighting.</Paragraph> <Paragraph position="1"> Context reduction is a step in which the sentence is reduced in size to the confusion word plus the seven words on either side of the word or up to the sentence boundary. The average sentence length in the corpus is 28 words, so this step has the effect of reducing the size of the data to approximately half the original. Intuitively, the reduction ought to improve performance by disallowing the distantly located words in long sentences to have any influence on the prediction of the confusion word because they usually have little or nothing to do with the selection of the proper word. In practice, however, the reduction we use had little effect on the predictions obtained from the LSA space.</Paragraph> <Paragraph position="2"> We ran some experiments in which we built LSA spaces using the whole sentence as well as other context window sizes. Smaller context sizes didn't seem to contain enough information to produce good predictions. Larger context sizes (up to the size of the entire sentence) produced results which were not significantly different from the results reported here. However, using a smaller context size reduces the total number of unique terms by an average of 13%.</Paragraph> <Paragraph position="3"> Correspondingly, using fewer terms in the initial matrix reduces the average running time and storage space requirements by 17% and 10% respectively.</Paragraph> <Paragraph position="4"> Stemming is the process of reducing each word to its morphological root. The goal is to treat the different morphological variants of a word as the same entity. For example, the words smile, smiled, smiles, smiling, and smilingly (all from the corpus) are reduced to the root smile and treated equally. We tried different stemming algorithms and all improved the predictive performance of LSA. The results presented in this paper are based on Porter's (Porter, 1980) algorithm.</Paragraph> <Paragraph position="5"> Bigram creation is performed for the words that were not removed in the context reduction step.</Paragraph> <Paragraph position="6"> Bigrams are formed between all adjacent pairs of words. The bigrams are treated as additional terms during the LSA space construction process. In other words, the bigrams fill their own row in the LSA matrix. null Term weighting is an effort to increase the weight or importance of certain terms in the high dimensional space. A local and global weighting is given to each term in each sentence. The local weight is a combination of the raw count of the particular term in the sentence and the term's proximity to the confusion word. Terms located nearer to the confusion word are given additional weight in a linearly decreasing manner. The local weight of each term is then flattened by taking its log2.</Paragraph> <Paragraph position="7"> The global weight given to each term is an attempt to measure its predictive power in the corpus as a whole. We found that entropy (see also (Lochbaum and Streeter, 1989)) performed best as a global measure. Furthermore, terms which did not appear in more than one sentence in the training corpus were removed.</Paragraph> <Paragraph position="8"> While LSA can be used to quickly obtain satisfactory results, some tuning of the parameters involved can improve its performance. For example, we chose (somewhat arbitrarily) to retain 100 factors for each LSA space. We wanted to fix this variable for all confusion sets and this number gives a good average performance. However, tuning the number of factors to select the &quot;best&quot; number for each space shows an average of 2% improvement over all the results and up to 8% for some confusion sets.</Paragraph> </Section> <Section position="3" start_page="167" end_page="169" type="sub_section"> <SectionTitle> 4.3 Testing </SectionTitle> <Paragraph position="0"> Once the LSA space for a confusion set has been created, it can be used to predict the word (from the confusion set) most likely to appear in a given sentence. We tested the predictive accuracy of the LSA space in the following manner. A sentence from the test corpus is selected and the location of the confusion word in the sentence is treated as an unknown word which must be predicted. One at a time, the words from the confusion set are inserted into the sentence at the location of the word to be predicted and the same transformations that the training sentences undergo are applied to the test sentence. The inserted confusion word is then removed from the sentence (but not the bigrams of which it is a part) because its presence biases the comparison which oc-Curs later. A vector in LSA space is constructed from the resulting terms.</Paragraph> <Paragraph position="1"> The word predicted most likely to appear in a sentence is determined by comparing the similarity of each test sentence vector to each confusion word vector from the LSA space. Vector similarity is evaluated by computing the cosine between two vectors.</Paragraph> <Paragraph position="2"> The pair of sentence and confusion word vectors with the largest cosine is identified and the corresponding confusion word is chosen as the most likely word for the test sentence. The predicted word is compared to the correct word and a tally of correct predictions is kept.</Paragraph> </Section> </Section> class="xml-element"></Paper>