File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0405_metho.xml

Size: 23,007 bytes

Last Modified: 2025-10-06 14:15:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0405">
  <Title>Modeling the language assessment process and result: Proposed architecture for automatic oral proficiency assessment</Title>
  <Section position="5" start_page="24" end_page="25" type="metho">
    <SectionTitle>
3 Domain: The Computerized Oral
Proficiency Instrument
</SectionTitle>
    <Paragraph position="0"> The Center for Applied Linguistics in Washington, D.C. (CAL) has developed or assisted in developing simulated oral proficiency interview (SOPI) tests for a variety of languages, recently adapting them to a computer-administered format, the COPI. Scoring at present is done entirely by human raters. The Spanish version of the COPI is in the beta-test phase; Chinese and Arabic versions are under development. All focus on assessing proficiency at the Intermediate Low level, defined by the American Council on the Teaching of Foreign Languages (ACTFL)</Paragraph>
    <Section position="1" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
Speaking Proficiency Guidelines (ACT, 1986),
</SectionTitle>
      <Paragraph position="0"> a common standard for passing at many high schools. We focus on Spanish, since we will have access to reed data. Our goal is to develop a system with a high interannotator agreement with human raters, such that it can replace one of the two or three raters required for oral proficiency interview scoring.</Paragraph>
      <Paragraph position="1"> With respect to non-acoustic features, our domain is tractable, for current natured language processing techniques, since the input is expected to be (at best) sentences, perhaps only phrases and words at Intermediate Low.</Paragraph>
      <Paragraph position="2"> Although the tasks and the language at this level are relatively simple, the domain varies enough to be interesting from a research standpoint: enumerating items in a picture, leaving a answering machine message, requesting a car rental, giving a sequence of directions, and describing one's family, among others. These tasks elicit a more varied, though still topically constrained, vocabulary. They also allow the assessment of the speaker's grasp of target language syntax, and, in the more advanced tasks, discourse structure and transitions. The COPI, therefore, provides a natured domain for rating non-native speech on both acoustic and non-acoustic features. These subsystems differ in terms of how amenable they are to machine modeling of the human process, as outlined below. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="25" end_page="26" type="metho">
    <SectionTitle>
4 Acoustic Features: The Speech
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
Recognition Process
</SectionTitle>
      <Paragraph position="0"> In the last two decades significant advances have been made in the field of automatic speech recognition (SR), both in commercial and research domains. Recently, research interest in recognizing non-native speech has increased, providing direct comparisons of recognition accuracy for non-native speakers at different levels of proficiency (Byrne et ed., 1998). Tomokiyo (p.c.), in experiments with the JANUS (Waibel et ed., 1992) speech system developed by Carnegie Mellon University, reports that systems with recognition accuracies of 85% for native speech perform at 40-50% for high fluency L2 learners (German, Tomokiyo, p.c.) and 30% for medium fluency speech (Japanese, Tomokiyo, p.c.).</Paragraph>
      <Paragraph position="1"> However, the current speech recognition technology makes little or no effort to model the human auditory or speech understanding process. Furthermore, standard SR approaches to speaker adaptation rely on relatively large amounts (20-30 minutes) of fixed, recorded speech (Jecker, 1998) to modify the underlying model, say in the case of accented speech, again unlike human listeners.</Paragraph>
      <Paragraph position="2"> While a complete reengineering of speech recognition is beyond the scope of our current project, we do attempt to model the human assessor's approach to understanding non-native speech. The SR system allows us two points of access through which linguistic knowledge of L2 phonology and grammar can be applied to improve recognition: the lexicon and the speech recognizer grammar.</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
4.1 Lexicon: Transfer-model-based
Phonological-Adaptation
</SectionTitle>
      <Paragraph position="0"> Since we have too little data for conventional speaker adaptation (less than 5 minutes of speech per examinee), we require a principled way of adapting an L1 or L2 recognizer model to non-native learner's speech that places less reliance upon recorded training data. We know that the pronunciation reflects native language influence, most notably at early stages (Novice and Intermediate), with which we are primarily concerned. Following the L2 transfer acquisition model, we assume that the L2 speaker, in attempting to produce target language utterances, will be influenced by L1 phonology and phonotactics. Thus, rather than being random divergences from TL pronunciation, errors should be closer to L1 phonetic realizations.</Paragraph>
      <Paragraph position="1"> To model these constraints, we will employ two distinct speech recognizers that can be employed to recognize L2 speech, produced by adaptations specific to Target Language (TL) and Source Language (SL). We propose to use language identification technology to arbitrate between the two sets of recognizer results, based on a sample of speech, either counting to 20, or a short read text. Since we need to choose between an underlying TL phonologiced model and one based on the SL, we will make the selection based on the language identification decision as to the apparent phonological identity of the sample as SL or TL, based on the sample's phonological and acoustic features (Berkling and Barnard, 1994; Hazen and  Zue, 1994; Kadambe and Hieronymus, 1994; Muthusamy, 1993). Parameterizing phonetic expectation based on a short sample of speech (Ladefoged and Broadbent, 1957) or expectations in context (Ohala and Feder, 1994) mirrors what people do in speech processing generally, independent of the rating context.</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
4.2 An acoustic grammar: Modeling
</SectionTitle>
      <Paragraph position="0"> the process Modeling the grammar of a Novice or Intermediate level L2 speaker for use by a speech recognizer is a challenging task. As noted in the ACTFL guidelines, these speakers are frequently inaccurate. However, to use the content of the speech in the assessment, we need to model human raters, who recognize even errorful speech as accurately and completely as possible. Speech recognizers work most effectively when perplexity is low, as is the case when the grammar and vocabulary are highly constrained. However, speech recognizers also recognize what they are told to expect, often accepting and misrecognizing utterances when presented with out-of-vocabulary or out-of-grammar input. We must balance these conflicting demands.</Paragraph>
      <Paragraph position="1"> We will take advantage of the fact that this task is being performed off-line and thus can tolerate recognizer speeds several times real-time. We propose a multi-pass recognition process with step-wise relaxation of grammatical constraints. The relaxed grammar specifies a noun phrase with determiner and optional adjective phrase but relaxes the target language restrictions on gender and number agreement among determiner, noun, and adjective and on position of adjective. Similar relaxations can be applied to other major constructions, such as verbs and verbal conjugations, to pass, without &amp;quot;correcting&amp;quot;, utterances with small target language inaccuracies. For those who would not reach such a level, and for tasks in which sentence-level structure is not expected, we must relax the grammar still further, relying on rejection at the first pass grammar to choose grammars appropriately. Successive relaxation of the grammar model will allow us to balance the need to reduce perplexity as much as possible with the need to avoid over-predicting and thereby correcting the learner's speech.</Paragraph>
    </Section>
    <Section position="4" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
4.3 Acoustic features: Modeling the
</SectionTitle>
      <Paragraph position="0"> result Research in the area of pronunciation scoring (Rypa, 1996; Franco et al., 1997; Ehsani and Knodt, 1998) has developed both direct and indirect measures of speech quality and pronunciation accuracy, none of which seem to model human raters at any level. The direct measures include calculations of phoneme error rate, computed as divergence from native speaker model standards, and number of incorrectly pronounced phonemes. The indirect measures attempt to capture some notion of fluency and include speaking rate, number and length of pauses or silences, and total utterance length. Analogous measures should prove useful in the current assessment of spoken proficiency. In addition, one could include, as a baseline, a human-scored measure of perceived accent or lack of fluency. A final measure of acoustic quality could be taken from the language identification process used in the arbitration phase, as to whether the utterance was more characteristic of the source or target language. In our samples of Intermediate Low passing speech we identify, for example, large proportions of silence to speech both between and within sentences.</Paragraph>
      <Paragraph position="1"> Some sentences are more than 50% silence.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="26" end_page="28" type="metho">
    <SectionTitle>
5 Natural Language Understanding:
Linguistic Features Assessment
</SectionTitle>
    <Paragraph position="0"> In the non-acoustic features, we have a fairly explicit notion of generative competence and a reasonable way of encoding syntax in terms of Context-Free Grammars (CFGs) and semantics via Lexical Conceptual Structures (LCSs). We do not know, however, the relative importance of different aspects of this competence in determining reached/not reached for particular levels in an assessment task. Therefore, we apply machine learning techniques to pool the humanidentified features, generating a machine-based model of process which is fully explicit and amenable to evaluation.</Paragraph>
    <Paragraph position="1"> The e-rater system, deployed by the Educational Testing Service (ETS) incorporates more than 60 variables based on properties used by human raters and divided into syntactic, rhetorical and topical content categories. Although the features deal with suprasentential structure, the reported variables (Burstein et al., 1998)  are identified via lexical information and shallow constituent parsing, arguably not modeling the human process.</Paragraph>
    <Paragraph position="2"> We attempt to model the features based on a deeper analysis of the structure of the text at various levels. We propose to parallel the architecture of the Military Language Tutoring system (MILT), developed jointly by the University of Maryland and Micro Analysis and Design corporation under army sponsorship. MILT provides a robust model of errors from English speakers learning Spanish and Arabic, identifying lexical and syntactic characteristics of short texts, as well as low-level semantic features, a prerequisite for more sophisticated inferencing (Dorr et al., 1995; Weinberg et al., 1995). At a minimum, the system will provide linguistically principled feedback on errors of various types, rather than providing system error messages, or crashing on imperfect input.</Paragraph>
    <Paragraph position="3"> Our work with MILT and the COPI beta-test data suggests that relevant features may be found in each of four main areas of spoken language processing: acoustic, lexical, syntactic/semantic ' and discourse. In order to automate the assessment stage of the oral proficiency exam, we must identify features of the L2 examinees' utterances that are correlated with different ratings and that can be extracted automatically. If we divide language up into separate components, we can describe a wide range of variation within a bounded set of parameters within these components. We can therefore build a cross-linguistically valid meta-interpreter with the properties we desire (compactness, robustness and extensibility). This makes both engineering and linguistic sense.</Paragraph>
    <Paragraph position="4"> Our system treats the constraints as submodules, able to be turned on or off, at the instructor's choice, made based on, e.g., what is learned early, and the level of correction desired. The MILT-style architecture allows us to make use of the University of Maryland's other parsing and lexicon resources, including large scale lexica in Spanish and English.</Paragraph>
    <Section position="1" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
5.1 Lexical features
</SectionTitle>
      <Paragraph position="0"> One would expect command of vocabulary to be a natural component of a language learner's proficiency in the taxget language. A variety of automatically extractable measures provide candidate features for assessing the examinee's lexical proficiency. In addition, the structure of the tasks in the examination allows for testing of extent of lexical knowledge in restricted common topics. For instance, the student may be asked to count to twenty in the target language or to enumerate the items in a pictured context, such as a classroom scene. Within these tasks one can test for the presence and number of specific desired vocabulary items, yielding another measure of lexical knowledge.</Paragraph>
      <Paragraph position="1"> Simple measures with numerical values include number of words in the speech sample and number of distinct words. In addition, examinees at this level frequently rely on vocabulary items from English in their answers.</Paragraph>
      <Paragraph position="2"> A deeper type of knowledge may be captured by the lexicon in Lexical Conceptual Structure (LCS) (Dorr, 1993b; Dorr, 1993a; Jackendoff, 1983). The LCS is an interlingual framework for representing semantic elements that have syntactic reflexes3 LCSs have been ported from English into a variety of languages, including Spanish, requiring a minimum of adaptation in even unrelated languages (e.g. Chinese (Olsen et al., 1998)). The representation indicates the argument-taking properties of verbs (hit requires an object; smile does not), selectional constraints (the subject of fear and the object of frighten are animate), thematic information of arguments (the subject of frighten is an agent; the object is a patient) and classification information of verbs (motion verbs like go are conceptually distinct from psychological verbs like fear/frighten; run is a more specific type of motion verb than go). Each information type is modularly represented and therefore may be separately analyzed and scored.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
5.2 Syntactic features
</SectionTitle>
      <Paragraph position="0"> We adopt a generative approach to grammar (Government and Binding, Principles and Parameter, Minimalism) principles. In these models, differences in the surface structure of languages can be reduced to a small number of modules and parameters. For example, although Spanish and English both have subject-verb-object (SVO) word order, the relative ordering of many nouns and adjectives differs (the  &amp;quot;head parameter&amp;quot;). 2 In English the adjective precedes the noun, whereas Spanish adjectives of nationality, color and shape regularly follow nouns (Whitley, 1986)\[pp. 241-2\]. The MILT architecture allows us both to enumerate errors of these types, and parse data that includes such errors. We will also consider measures of number and form of distinct construction types, both attempted and correctly completed. Such constructs could include simple declaratives (subject, verb, and one argument), noun phrases with both determiner and adjective, with correct agreement and word order, questions, and multi-clause sentences.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
5.3 Semantic features
</SectionTitle>
      <Paragraph position="0"> Like the syntactic information, lexical information can be used modularly to assess a variety of properties of examinees' speech. The Lexical Conceptual Structure (LCS) allows principles of robustness, flexibility and modularity to apply to the semantic component of the proposed system. The LCS serves as the basis of several different applications, including machine translation and information retrieval as well as foreign language tutoring. The LCS is considered a subset of mental representation, that is, the language of mental representation as realized in language (Dorr et al., 1995). Event types such as event and state, are represented in primitives such as GO, STAY, BE, GO-EXT and ORI-ENT, used in spatial and other 'fields'. As such, it allows potential modeling of human rater processes. null The LCS allows various syntactic forms to have the same semantic representation, e.g.</Paragraph>
      <Paragraph position="1"> Walk to the table and pick up the book, Go to the table and remove the book, or Retrieve the book from the table. COPI examinees are also expected to express similar information in different ways. We propose to use the LCS structure to handle and potentially enumerate competence in this type of variation.</Paragraph>
      <Paragraph position="2"> Stored LCS representations may also handle hierarchical relations among verbs, and divergences in the expression of elements of meaning, sometimes reflecting native language word order. The modularity of the system allows us to tease the semantic and syntactic features apart, 2Other parameters deal with case, theta-role assignment, binding, and bounding.</Paragraph>
      <Paragraph position="3"> giving credit for the semantic expression, but identifying divergences from the target L2.</Paragraph>
    </Section>
    <Section position="4" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
5.4 Discourse features
</SectionTitle>
      <Paragraph position="0"> Since the ability to productively combine words in phrase and sentence structures separates the Intermediate learner from the Novice, features that capture this capability should prove useful in semi-automatic assessment of Intermediate Low level proficiency, our target level. According to the ACTFL Speaking Proficiency Guidelines, Intermediate Low examinees begin to compose sentences; full discourses do not emerge until later levels of competence. Nevertheless, we want both to give credit for any discourse-level features that surface, as well as to provide a slot for such features, to allow scalability to more complex tasks and higher levels of competence. We will therefore develop discourse and dialog models, with the appropriate and measurable characteristics. Many of these can be lexically or syntactically identified, as the ETS GMAT research shows (Burstein et al., 1998). Our data might include uses of discourse connectives (entonces 'then; in that case' pero 'but'; es que 'the fact is'; cuando 'when'), other subordinating structures (Yo creo que 'I think that') and use of pronouns instead of repeated nouns. The discourse measures can easily be expanded to cover additional, more advanced constructions that are lexically signaled, such as the use of subordination or paragraph-level structures.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="28" end_page="29" type="metho">
    <SectionTitle>
6 Machine learning
</SectionTitle>
    <Paragraph position="0"> While the above features capture some measure of target language speaking proficiency, it is difficult to determine a priori which features or groups of features will be most useful in making an accurate assessment. In this work, human assessor ratings for those trained on the Speaking Proficiency Guidelines will be used as the &amp;quot;Gold Standard&amp;quot; for determining accuracy of automatic assessment. We plan to apply machine learning techniques to determine the relative importance of different feature values in rating a speech sample.</Paragraph>
    <Paragraph position="1"> The assessment phase goes beyond the current work in test scoring, combining recognition of acoustic features, such as the Automatic Spoken Language Assessment by Telephone (ASLAT) or PhonePass (Ordinate, 1998)  with aspects of the syntactic, discourse, and semantic factors, as in e-rater. Our goal is to have the automatic scoring system mirror the outcome of raters trained in the ACTFL Guidelines (ACT, 1986), to determine whether examinees did or did not reach the Intermediate Low level. We also aim to make the process of feature weighting transparent, so that we can determine whether the system provides an adequate model of the human rating process. We will evaluate quantitatively the extent to which machine classification agrees with human raters on both acoustic and non-acoustic properties alone and separately. We will also evaluate the process qualitatively with human raters.</Paragraph>
    <Paragraph position="2"> We plan to exploit the natural structuring of the data features through decision trees or a small hierarchical &amp;quot;mixture-of-experts&amp;quot;- type model (Quinlan, 1993; Jacobs et al., 1991; Jordan and Jacobs, 1992). Intuitively, the latter approach creates experts (machine-learning trained classifiers) for each group of features (acoustic, lexical, and so on). The correct way of combining these experts is then acquired though similar machine learning techniques. The organization of the classifier allows the machine learning technique at each stage of the hierarchy to consider fewer features, and thus, due to the branching structure of tree classifier, dramatically fewer classifier configurations. null Decision tree type classifiers have an additional advantage: unlike neural network or nearest neighbor classifiers, they are easily interpretable by humans. The trees can be rewritten trivially as sequences of if-then rules leading to a certain classification. For instance, in the assessment task, one might hypothesize a rule of the form: IF silence &gt; 20% of utterance, THEN Intermediate Low NOT REACHED. It is thus possible to have human raters analyze how well the rules agree with their own intuitions about scoring and to determine which automatic features play the most important role in assessment. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML