File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/n03-2030_concl.xml

Size: 4,851 bytes

Last Modified: 2025-10-06 13:53:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2030">
  <Title>A Hybrid Approach to Content Analysis for Automatic Essay Grading</Title>
  <Section position="4" start_page="0" end_page="0" type="concl">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> We conducted an evaluation to compare the effectiveness of CarmelTC at analyzing student essays in comparison to LSA, Rainbow, and a purely symbolic approach similar to (Furnkranz et al., 1998), which we refer to here as CarmelTCsymb. CarmelTCsymb is identical to CarmelTC except that it does not include in its feature set the prediction from Rainbow. We conducted our evaluation over a corpus of 126 previouslyunseen student essays in response to the Pumpkin Problem described above, with a total of 500 text segments, and just under 6000 words altogether. Each text segment was hand tagged by at least two coders, and conflicts were resolved at a consensus meeting. Pairwise Kappas between our three coders computed over initial codings of our data was always above .75.</Paragraph>
    <Paragraph position="1"> The LSA space used for this evaluation was trained over three first year physics text books. The Rainbow models used to generate the Rainbow predictions that are part of the feature set provided to CarmelTC were trained over a development corpus of 248 hand tagged example sentences extracted from a corpus of human-human tutoring dialogues, just like those included in the 126 essays mentioned above. However, when we evaluated the performance of Rainbow for comparison with CarmelTC, LSA, and the symbolic approach, we ran a 50 fold cross validation evaluation using the complete set of examples in both sets (i.e., the 248 sentences used to train the Rainbow models used to by CarmelTC as well as the 126 essays) so that Rainbow would have access to the exact same training data as CarmelTC, to make it a fair comparison between alternative machine learning approaches. On each iteration, we randomly selected a subset of essays such that the number of text segments included in the test set were greater than 10 but less than 15 and then training Rainbow using the remaining text segments. Thus, CarmelTC uses the same set of training data, but unlike the other approaches, it uses its training data in two separate parts, namely one to train the Rainbow models it uses to produce the Rainbow prediction that is part of the vector representation it builds for each text and one to train the decision trees. This is because for CarmelTC, the data for training Rainbow must be separate from that used to train the decision trees so the decision trees are trained from a realistic distribution of assigned Rainbow classes based on its performance on unseen data rather than on  Rainbow's training data. Thus, for CarmelTC, we also performed a 50 fold cross validation, but this time only over the set of 126 example essays not used to train the Rainbow models used by CarmelTC.</Paragraph>
    <Paragraph position="2"> Note that LSA works by using its trained LSA space to construct a vector representation for any text based on the set of words included therein. It can thus be used for text classification by comparing the vector obtained for a set of exemplar texts for each class with that obtained from the text to be classified. We tested LSA using as exemplars the same set of examples used as Rainbow training data, but it always performed better when using a small set of hand picked exemplars. Thus, we present results here using only those hand picked exemplars. For every approach except LSA, we first segmented the essays at sentence boundaries and classified each sentence separately. However, for LSA, rather than classify each segment separately, we compared the LSA vector for the entire essay to the exemplars for each class (other than &amp;quot;nothing&amp;quot;), since LSA's performance is better with longer texts. We verified that LSA also performed better specifically on our task under these circumstances. Thus, we compared each essay to each exemplar, and we counted LSA as identifying the corresponding &amp;quot;correct answer aspect&amp;quot; if the cosine value obtained by comparing the two vectors was above a threshold. We used a threshold value of .53, which we determined experimentally to achieve the optimal f-score result, using a beta value of 1 in order to treat precision and recall as equally important.</Paragraph>
    <Paragraph position="3"> Figure 1 demonstrates that CarmelTC out performs the other approaches, achieving the highest f-score, which combines the precision and recall scores into a single measure. Thus, it performs better at this task than two commonly used purely &amp;quot;bag of words&amp;quot; approaches as well as to an otherwise equivalent purely symbolic approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML