File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1706_intro.xml
Size: 8,070 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1706"> <Title>E-Assessment using Latent Semantic Analysis in the Computer Science Domain: A Pilot Study</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper describes a pilot study undertaken to investigate the feasibility of using Latent Semantic Analysis (LSA) for automatic marking of short essays in the domain of computer science. These short essays are free-form answers to exam questions - not multiple choice questions (MCQ).</Paragraph> <Paragraph position="1"> Exams in the form of MCQs, although easy to mark, do not provide the opportunity for deeper assessment made possible with essays.</Paragraph> <Paragraph position="2"> This study employs LSA in several areas that are under-researched. First, it uses very small corpora - less than 2,000 words compared to about 11 million words in one of the existing, successful applications (Wade-Stein & Kintsch, 2003).</Paragraph> <Paragraph position="3"> Second, it involves the specific, technical domain of computer science. LSA research usually involves more heterogeneous text with a broad vocabulary. Finally, it focuses on summative assessment where the accuracy of results is paramount. Most LSA research has involved formative assessment for which more general evaluations are sufficient.</Paragraph> <Paragraph position="4"> The study investigates one of the shortcomings of LSA mentioned by Manning and Schutze (1999, p. 564). They report that LSA has high recall but low precision. The precision declines because of spurious co-occurrences. They claim that LSA does better on heterogeneous text with a broad vocabulary. Computer science is a technical domain with a more homogeneous vocabulary, which results, possibly, in fewer spurious cooccurrences. A major question of this research is how LSA will behave when the technique is stretched by applying it to a narrow domain.</Paragraph> <Paragraph position="5"> Section 2 gives the history of LSA and explains how it works. Section 3 describes several existing LSA applications related to e-assessment. Section 4 provides the motivation for more LSA research and reports on a pilot study undertaken to assess the feasibility of using LSA for automatic marking of short essays in the domain of computer science. Section 5 lists several open issues and areas for improvement that future studies will address.</Paragraph> <Paragraph position="6"> Finally, Section 6 summarises the paper.</Paragraph> <Paragraph position="7"> 2 What is Latent Semantic Analysis? &quot;Latent Semantic Analysis is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text&quot; (Landauer, Foltz & Laham, 1998). It is a statistical-based natural language processing (NLP) method for inferring meaning from a text . It was developed by researchers at Bellcore as an information retrieval technique (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990) in the late 1980s. The earliest application of LSA was Latent Semantic Indexing (LSI) (Furnas, et al., 1988; Deerwester, et al., 1990). LSI provided an advantage over keyword-based methods in that it could induce associative meanings of the query (Foltz, 1996) rather than relying on exact matches. Landauer and Dumais (1997) promoted LSA as a model for the human acquisition of knowledge.</Paragraph> <Paragraph position="8"> They developed their theory after creating an information retrieval tool and observing unexpected results from its use. They claimed that The researchers originally used the term LSI (Latent Semantic Indexing) to refer to the method. The information retrieval community continues to use the term LSI.</Paragraph> <Paragraph position="9"> LSA solves Plato's problem, that is, how do people learn so much when presented with so little? Their answer is the inductive process: LSA &quot;induces global knowledge indirectly from local co-occurrence data in a large body of representative text&quot; (Landauer & Dumais, 1997).</Paragraph> <Paragraph position="10"> From the original application for retrieving information, the applications of LSA have evolved to systems that more fully exploit its ability to extract and represent meaning. Recent applications based on LSA compare a sample text with a preexisting, very large corpus to judge the meaning of the sample.</Paragraph> <Paragraph position="11"> To use LSA, researchers amass a suitable corpus of text. They create a term-by-document matrix where the columns are documents and the rows are terms (Deerwester, et al., 1990). A term is a subdivision of a document; it can be a word, phrase, or some other unit. A document can be a sentence, a paragraph, a textbook, or some other unit. In other words, documents contain terms. The elements of the matrix are weighted word counts of how many times each term appears in each document. More formally, each element, a ij in an i x j matrix is the weighted count of term i in document j.</Paragraph> <Paragraph position="12"> LSA decomposes the matrix into three matrices using Singular Value Decomposition (SVD), a well-known technique (Miller, 2003) that is the general case of factor analysis. Deerwester et. al., (1990) describe the process as follows.</Paragraph> <Paragraph position="13"> Let t = the number of terms, or rows d = the number of documents, or columns X = a t by d matrix Then, after applying SVD, X = TSD, where m = the number of dimensions, m <= min(t,d)</Paragraph> <Paragraph position="15"> diagonal entries have non-zero values D = an m by d matrix LSA reduces S, the diagonal matrix created by SVD, to an appropriate number of dimensions k, where k << m, resulting in S'. The product of TS'D is the least-squares best fit to X, the original matrix (Deerwester, et al., 1990).</Paragraph> <Paragraph position="16"> The literature often describes LSA as analyzing co-occurring terms. Landauer and Dumais (1997) argue it does more and explain that the new matrix captures the &quot;latent transitivity relations&quot; among the terms. Terms not appearing in an original document are represented in the new matrix as if they actually were in the original document (Landauer & Dumais, 1997). LSA's ability to induce transitive meanings is considered especially important given that Furnas et. al. (1982) report fewer than 20% of paired individuals will use the same term to refer to the same common concept.</Paragraph> <Paragraph position="17"> LSA exploits what can be named the transitive property of semantic relationships: If A-B and B-C, then A-C (where - stands for is semantically related to). However, the similarity to the transitive property of equality is not perfect. Two words widely separated in the transitivity chain can have a weaker relationship than closer words. For example, LSA might find that copy duplicate - double - twin - sibling. Copy and duplicate are much closer semantically than copy and sibling.</Paragraph> <Paragraph position="18"> Finding the correct number of dimensions for the new matrix created by SVD is critical; if it is too small, the structure of the data is not captured. Conversely, if it is too large, sampling error and unimportant details remain, e.g., grammatical variants (Deerwester, et al., 1990; Miller, 2003; Wade-Stein & Kintsch, 2003). Empirical work involving very large corpora shows the correct number of dimensions to be about 300 (Landauer & Dumais, 1997; Wade-Stein & Kintsch, 2003).</Paragraph> <Paragraph position="19"> Creating the matrices using SVD and reducing the number of dimensions, often referred to as training the system, requires a lot of computing power; it can take hours or days to complete the processing (Miller, 2003). Fortunately, once the training is complete, it takes just seconds for LSA to evaluate a text sample (Miller, 2003).</Paragraph> </Section> class="xml-element"></Paper>