File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1706_metho.xml

Size: 16,327 bytes

Last Modified: 2025-10-06 14:09:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1706">
  <Title>E-Assessment using Latent Semantic Analysis in the Computer Science Domain: A Pilot Study</Title>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Using LSA for assessment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Types of assessment
</SectionTitle>
      <Paragraph position="0"> Electronic feedback, or e-assessment, is an important component of e-learning. LSA, with its ability to provide immediate, accurate, personalised, and content-based feedback, can be an important component of an e-learning environment.</Paragraph>
      <Paragraph position="1"> Formative assessment provides direction, focus, and guidance concurrent with the learner engaging in some learning process. E-assessment can provide ample help to a learner without requiring added work by a human tutor. A learner can benefit from private, immediate, and convenient feedback.</Paragraph>
      <Paragraph position="2"> Summative assessment, on the other hand, happens at the conclusion of a learning episode or activity. It evaluates a learner's achievement and communicates that achievement to interested parties. Summative assessment using LSA shares the virtues of formative assessment and can produce more objective grading results than those that can occur when many markers are assessing hundreds of student essays.</Paragraph>
      <Paragraph position="3"> The applications described in the next section use LSA to provide formative assessment. Section 4 discusses a pilot study that focuses on summative assessment.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Existing applications
</SectionTitle>
      <Paragraph position="0"> Much work is being done in the area of using LSA to mark essays automatically and to provide content-based feedback. One of the great advantages of automatic assessment of essays is its ability to provide helpful, immediate feedback to the learner without burdening the teacher. This application is particularly suited to distance education, where opportunities for one-on-one tutoring are infrequent or non-existent (Steinhart, 2001). Existing systems include Apex (Lemaire &amp; Dessus, 2001), Autotutor (Wiemer-Hastings, Wiemer-Hastings &amp; Graesser, 1999), Intelligent Essay Assessor (Foltz, Laham &amp; Landauer, 1999), Select-a-Kibitzer (Miller, 2003), and Summary Street (Steinhart, 2001; Wade-Stein &amp; Kintsch, 2003). They differ in details of audience addressed, subject domain, and advanced training required by the system (Miller, 2003). They are similar in that they are LSA-based, web-based, and provide scaffolding, feedback, and unlimited practice opportunities without increasing a teacher's workload (Steinhart, 2001). All of them claim that LSA correlates as well to human markers as human markers correlate to one another. See (Miller, 2003) for an excellent analysis of these systems.</Paragraph>
      <Paragraph position="1"> 4 E-Assessment pilot study Although research using Latent Semantic Analysis (LSA) to assess essays automatically has shown promising results (Chung &amp; O'Neil, 1997; Foltz, et al., 1999; Foltz, 1996; Lemaire &amp; Dessus, 2001; Landauer, et al., 1998; Miller, 2003; Steinhart, 2001; Wade-Stein &amp; Kintsch, 2003), not enough research has been done on using LSA for instructional software (Lemaire &amp; Dessus, 2001).</Paragraph>
      <Paragraph position="2"> Previous studies involved both young students and university-age students, and several different knowledge domains. An open question is how LSA can be used to improve the learning of universityage, computer science students. This section offers three characteristics that distinguish this research from existing research involving the use of LSA to analyse expository writing texts and reports on a pilot study to determine the feasibility of using LSA to mark students' short essay answers to exam questions.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Focuses of the experiment
</SectionTitle>
      <Paragraph position="0"> This subsection describes three facets of the experiment that involve under-researched areas, in the cases of the domain and the type of assessment, and an unsolved research question in the case of the appropriate dimension reduction value for small corpora.</Paragraph>
      <Paragraph position="1"> The study involves essays written by computer science (CS) students. CS, being a technical domain, has a limited, specialist vocabulary. Thus, essays written for CS exams are thought to have a more restricted terminology than do the expository writing texts usually analysed by LSA researchers. Nevertheless, the essays are written in English using a mixture of technical terms and general terms. Will LSA produce valid results? Accuracy is paramount in summative assessment. Whereas formative assessment can be general and informative, summative assessment requires a high degree of precision. Can LSA produce results with a high degree of correlation with human markers? The consensus among LSA researchers, who customarily use very large corpora, is that the number of dimensions that produces the best result is about 300. But because this study involved just 17 graded samples, the number of reduced dimensions has to be less than 17. Can LSA work with many fewer dimensions than 300? A broader question is whether LSA can work with a small corpus in a restricted domain.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 The Data
</SectionTitle>
      <Paragraph position="0"> The data for this experiment consisted of answers from six students to three questions in a single electronic exam held at the Open University in April 2002. The answers are free-form short essays. The training corpus for each question comprised 16 documents consisting of student answers to the same question and a specimen solution. Table 1 gives the average size (in words) of both the student answers graded by LSA and the corpus essays.</Paragraph>
      <Paragraph position="1">  The corpus training documents had been marked previously by three trained human markers. The average marks were assigned to each corpus document. To provide a standard on which to judge the LSA results, each of the answers from the six students was marked by three human markers and awarded the average mark.</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 The LSA Method
</SectionTitle>
      <Paragraph position="0"> The following steps were taken three times, once for each question on the exam.</Paragraph>
      <Paragraph position="1"> * Determine the words, or terms, in the corpus documents after removing punctuation and stop words. (No attempt has yet been made to deal with synonyms or word forms, such as plurals, via stemming.) * Construct a t x d term frequency matrix M, where t is the number of terms in the corpus and d is the number of documents - 17 in this experiment. Each entry tf ij is the number of times term i appears in document j.  matches the student-answer by comparing a' with the column vectors in B.</Paragraph>
      <Paragraph position="2"> * Award the student-answer the mark associated with the most similar corpus document using the cosine similarity measure.</Paragraph>
    </Section>
    <Section position="6" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.4 Determining the optimum dimension
</SectionTitle>
      <Paragraph position="0"> reduction (k) * This experiment reduced the SVD matrices using k = 2 .. number of corpus documents 1, or k = 2 .. 16. For each value of k, the LSA method produced a mark for each student-answer.</Paragraph>
      <Paragraph position="1"> * The experiment compared the six LSA marks for the student-answers with the corresponding average human mark using Euclidean distance.</Paragraph>
      <Paragraph position="2"> * The experiment revealed that, for this corpus, k = about 10 gave the best matches across the three questions.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4.5 Results
</SectionTitle>
    <Paragraph position="0"> The four graphs below show the results obtained.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.6 Discussion
</SectionTitle>
      <Paragraph position="0"> This experiment investigated the feasibility of using LSA to assess short essay answers. The results shown in Figures 1 - 3 suggest that LSAmarked answers were similar to human-marked answers in 83% (15 of 18) of the answers tested.</Paragraph>
      <Paragraph position="1"> LSA seemed to work well on five of the six student-answers for Question A, all the answers for Question B, and four of the six answers for Question C. For the three clearly incorrect answers, LSA gave a higher score than did the human markers for the answer to question A and one higher mark and one lower mark than did the human markers for the answers to question C.</Paragraph>
      <Paragraph position="2"> To quantify these visual impressions, the study used the Spearman's rho statistical test for each of the three questions. Only one of the three questions shows a statistical correlation between LSA and human marks: question B shows a statistical correlation significant at the 95% level.</Paragraph>
      <Paragraph position="3"> These results, while unacceptable for a real-world application, are encouraging given the extremely small corpus size of only 17 documents, or about 2,000 words for questions A and C and about 600 words for question B. This pilot study solidified our understanding of how to use LSA, the importance of a large corpus, and how to approach further research to improve the results and increase the applicability of the results of this pilot study.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="11" type="metho">
    <SectionTitle>
5 A roadmap for further research
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.1 The corpus
</SectionTitle>
      <Paragraph position="0"> LSA results depend on both corpus size and corpus content.</Paragraph>
      <Paragraph position="1">  Existing LSA research stresses the need for a large corpus. The corpora for the pilot study described in this paper were very small. In addition, the documents are too few in number to be representative of the student population. An ideal corpus would provide documents that give a spread of marks across the mark range and a variety of answers for each mark. Future studies will use a larger corpus.</Paragraph>
      <Paragraph position="2">  Wiemer-Hastings, et. al (1999) report that size is not the only important characteristic of the corpus. Not surprisingly, the composition of the corpus effects the results of essay grading by LSA. In addition to specific documents directly related to their essay questions, Wiemer-Hastings, et. al used more general documents. They found the best composition to be about 40% general documents and 60% specific documents.</Paragraph>
      <Paragraph position="3"> The corpora used for this pilot study comprised only specific documents - the human marked short essays. Future work will involve adding sections of text books to enlarge and enrich the corpus with more general documents.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 Weighting function
</SectionTitle>
      <Paragraph position="0"> The pilot study used local weighting - the most basic form of term weighting. Local weighting is defined as tf ij (the number of times term i is found in document j) dampened by the log function: local</Paragraph>
      <Paragraph position="2"> ). This dampening reflects the fact that a term that appears in a document x times more frequently than another term is not x times more important.</Paragraph>
      <Paragraph position="3"> The study selected this simple weighting function to provide a basis on which to compare more sophisticated functions in future work. Many variations of weighting functions exist; two are described next.</Paragraph>
      <Paragraph position="4">  Dumais (1991) recommended using log-entropy weighting, which is local weighting times global weighting. Global weighting is defined as 1 - the entropy or noise. Global weighting attempts to quantify the fact that a term appearing in many documents is less important than a term appearing in fewer documents.</Paragraph>
      <Paragraph position="5"> The log-entropy term weight for term i in doc j =  that is, the number of documents in Tr in which t k occurs.</Paragraph>
      <Paragraph position="6"> Future studies will examine the effects of applying various term weighting functions.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="11" type="sub_section">
      <SectionTitle>
5.3 Similarity measures
</SectionTitle>
      <Paragraph position="0"> The pilot study used two different similarity measures. It used the cosine measure to compare the test document with the corpus documents. It used Euclidean distance to choose k, the number of reduced dimensions that produced the best results overall. Other measures exist and will be tried in future studies.</Paragraph>
      <Paragraph position="1"> Ljungstrand and Johansson (1998) define the following similarity measures:</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="4" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
5.4 Corpus pre-processing
</SectionTitle>
      <Paragraph position="0"> Removing stop words is one type of pre-processing performed for this study. Explicitly adding synonym knowledge and stemming are two additional ways of preparing the corpus that future research will consider. Stemming involves conflating word forms to a common string, e.g., write, writing, writes, written, writer would be represented in the corpus as writ.</Paragraph>
    </Section>
    <Section position="5" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
5.5 Dimension reduction
</SectionTitle>
      <Paragraph position="0"> Choosing the appropriate dimension, k, for reducing the matrices in LSA is a well known open issue. The current consensus is that k should be about 300. No theory yet exists to suggest the appropriate value for k. Currently, researchers determine k by empirically testing various values of k and selecting the best one. The only heuristic says that k &lt;&lt; min(terms, documents). An interesting result from the study reported in this paper is that even though k had to be less than 17, the number of documents in our corpora and thus much less than the recommended value of 300, LSA produced statistically significant results for one of the three questions tested.</Paragraph>
      <Paragraph position="1"> Future studies will continue to investigate the relationship among k, the size of the corpus, the number of documents in the corpus, and the type of documents in the corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="11" end_page="11" type="metho">
    <SectionTitle>
6 Summary
</SectionTitle>
    <Paragraph position="0"> This paper introduced and explained LSA and how it can be used to provide e-assessment by both formative and summative assessment. It provided examples of existing research that uses LSA for eassessment. It reported the results of a pilot study to determine the feasibility of using LSA to assess automatically essays written in the domain of computer science. Although just one of the three essay questions tested showed that LSA marks were statistically correlated to the average of three human marks, the results are promising because the experiment used very small corpora.</Paragraph>
    <Paragraph position="1"> Future studies will attempt to improve the results of LSA by increasing the size of the corpora, improving the content of the corpora, experimenting with different weighting functions and similarity measures, pre-processing the corpus, and using various values of k for dimension reduction.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML