File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/w99-0411_abstr.xml

Size: 5,435 bytes

Last Modified: 2025-10-06 13:49:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0411">
  <Title>Automated Essay Scoring for Nonnative English Speakers</Title>
  <Section position="1" start_page="0" end_page="68" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The e-rater system TM ~ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays.</Paragraph>
    <Paragraph position="1"> This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by nonnative English speakers whose native language is Chinese, Arabic, or Spanish. In addition, one small sample of the data is from US-born English speakers, and another is from non-US-born candidates who report that their native language is English. As expected, significant differences were found among the scores of the English groups and the nonnative speakers. While there were also differences between e-rater and the human readers for the various language groups, the average agreement rate was as high as operational agreement. At least four of the five features that are included in e-rater's current operational models (including discourse, topical, and syntactic features) also appear in the TWE models. This suggests that the features generalize well over a wide range of linguistic variation, as e-rater was not 1 The e-rater system TM is a trademark of Educational Testing Service. In the paper, we will refer to the e-rater system TM as e-rater. confounded by non-standard English syntactic structures or stylistic discourse structures which one might expect to be a problem for a system designed to evaluate native speaker writing.</Paragraph>
    <Paragraph position="2"> Introduction Research and development in automated essay scoring has begun to flourish in the past five years or so, bringing about a whole new field of interest to the NLP community (Burstein, et al (1998a, 1998b and 1998c), Foltz, et al (1998), Larkey (1998), Page and Peterson (1995)). Research at Educational Testing Service (ETS) has led to the recent development of e-rater, an operational automated essay scoring system. E-rater is based on features in holistic scoring guides for human reader scoring. Scoring guides have a 6-point score scale. Six's are assigned to the &amp;quot;best&amp;quot; essays, and &amp;quot;l's&amp;quot; to the least well-written. Scoring guide criteria are based on structural (syntax and discourse) and vocabulary usage in essay responses (see http://www.gmat.org).</Paragraph>
    <Paragraph position="3"> E-rater builds new models for each topic (prompt-specific models) by evaluating approximately 52 syntactic, discourse and topical analysis variables for 270 human reader scored training essays. Relevant features for each model are based on the predictive feature set identified by a stepwise linear regression. In operational scoring, when compared to a human reader,  e-rater assigns an exactly matching or adjacent score (on the 6-point scale) about 92% of the time. This is the same as the agreement rate typically found between two human readers. Correlations between e-rater scores and those of a single human reader are about .73; correlations between two human readers are .75.</Paragraph>
    <Paragraph position="4"> The scoring guide criteria assume standard written English. Non-standard English may show up in the writing of native English speakers of non-standard dialects. For general NLP research purposes, it is useful to have computer-based corpora that represent language variation (Biber (1993)). Such corpora allow us to explore issues with regard to how the system will handle responses that might be written in non-standard English. Current research at ETS for the Graduate Record Examination (GRE) (Burstein, et al, 1999) is making use of essay corpora that represent subgroups where variations in standard written English might be found, such as in the writing of African Americans, Latinos and Asians (Breland, et al (1995) and Bridgeman and McHale (1996)). In addition, ETS is accumulating essay corpora of nonnative speakers that can be used for research.</Paragraph>
    <Paragraph position="5"> This paper focuses on preliminary data that show e-rater's performance on Test of Written English (TWE) essay responses written by nonnative English speakers whose native language is Chinese, Arabic, or Spanish. A small sample of the data is from US-born English speakers and a second small sample is from non-US-born candidates who report that their native language is English. The data were originally collected for a study by Frase, et al (1997) in which analyses of the essays are also discussed. The current work is only the beginning of a program of research at ETS that will examine automated scoring for nonnative English speakers. Overall goals include determining how features used in automated scoring may also be used to (a) examine the difficulty of an essay question for speakers of particular language groups, and (b) automatically formulate diagnostics and instruction for nonnative English speakers, with customization for different language groups.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML