XML Viewer - w03-0203

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0203_intro.xml
Size: 14,439 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0203">
  <Title>Computer-Aided Generation of Multiple-Choice Tests</Title>
  <Section position="3" start_page="9" end_page="15" type="intro">
    <SectionTitle>
4. Evaluation
</SectionTitle>
    <Paragraph position="0"> In order to validate the efficiency of the method, we evaluated the performance of the system in two different ways. Firstly, we investigated the efficiency of the procedure by measuring the average time needed to produce a test item with the help of the program as opposed to the average time needed to produce a test item manually.</Paragraph>
    <Paragraph position="1">  Secondly, we examined the quality of the items generated with the help of the program, and compared it with the quality of the items produced manually. The quality was assessed via standard test theory measures such as discriminating power and difficulty of each test item, and the usefulness of each alternative was applied.</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
4.1 The procedure of generating test items with the
</SectionTitle>
      <Paragraph position="0"> help of the program and its efficiency The first step of the procedure consists of the automatic generation of test items. The items so generated were then either (i) declared as 'worthy' and accepted for direct use without any revision, or further post-edited before being put into use, or (ii) declared as 'unworthy' and discarded. 'Unworthy' items were those that did not focus on a central concept or required too much revision, and so they were rejected.</Paragraph>
      <Paragraph position="1"> The items selected for further post-editing required minor, fair or major revisions. 'Minor' revision describes minor syntactical post-editing of the test question, including minor operations such insertions of articles, correction of spelling and punctuation. 'Fair' revision refers to some grammatical post-editing of the test question, including re-ordering or deletion of words and replacement of one distractor at most. 'Major' revision applied to the generated test items involved more substantial grammatical revision of the test question and replacement of two or more of the  The position of the correct answer (in this case 'reflexive pronoun') is generated randomly.</Paragraph>
      <Paragraph position="2">  Two graduate students in linguistics acted as posteditors. The same students were involved in the production of test items manually. The texts used were selected with care so that possible influence of potentially similar or familiar texts was minimised. See also the discussion in section 5 on the effect of familiarity.</Paragraph>
      <Paragraph position="3"> distractors. As an illustration, the automatically generated test item  (3) Which kind of language unit seem to be the most obvious component of language, and any theory that fails to account for the contribution of words to the functioning of language is unworthy of our attention? (a) word (b) name (c) syllable (d) morpheme  was not acceptable in this form and required the deletion of the text 'and any theory that fails to account for the contribution of words to the functioning of language is unworthy of our attention' which was classed as 'fair' revision.</Paragraph>
      <Paragraph position="4"> From a total of about 575 items automatically generated by the program, 57% were deemed to be 'worthy' i.e. considered for further use. From the worthy items, 6% were approved for direct class test use without any post-editing and 94% were subjected to post-editing. From the items selected for revision, 17% needed minor revision, 36% needed fair revision and 47% needed major revision.</Paragraph>
      <Paragraph position="5"> The time needed to produce 300 test items with the help of the program, including the time necessary to reject items, accept items for further editing or approve for direct use, amounted to 9 hours. The time needed to manually produce 65 questions was 7 hours and 30 minutes. This results in an average of 1 minute and 48 seconds to produce a test item with the help of the program and an average of 6 minutes and 55 seconds to develop a test item manually (Table 1).</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="14" type="sub_section">
      <SectionTitle>
4.2 Analysis of the items generated with the help of
</SectionTitle>
      <Paragraph position="0"> the program Item analysis is an important procedure in classical test theory which provides information as to how well each item has functioned. The item analysis for multiple-choice tests usually consists of the following information (Gronlund 1982): (i) the difficulty of the item, (ii) the discriminating power and (iii) the usefulness  of each alternative. This information can tell us if a specific test item was too easy or too hard, how well it discriminated between high and low scorers on the test and whether all of the alternatives functioned as intended. Such types of analysis help improve test items or discard defective items.</Paragraph>
      <Paragraph position="1"> In order to conduct this type of analysis, we used a simplified procedure, described in (Gronlund 1982). We arranged the test papers in order from the highest score to the lowest score. We selected one third of the papers and called this the upper group (15 papers). We also selected the same number of papers with the lowest scores and called this the lower group (15 papers). For each item, we counted the number of students in the upper group who selected each alternative; we made the same count for the lower group.</Paragraph>
      <Paragraph position="2"> (i) Item Difficulty We estimated the Item Difficulty (ID) by establishing the percentage of students from the two groups who answered the item correctly (ID = C/T x 100, where C is the number who answered the item correctly and T is the total number of students who attempted the item). From the 24 items subjected to analysis, there were 0 too difficult and 3 too easy items.</Paragraph>
      <Paragraph position="3">  The average item difficulty was 0.75.</Paragraph>
      <Paragraph position="4"> (ii) Discriminating Power We estimated the item's Discriminating Power (DP) by comparing the number students in the upper and lower groups who answered the item correctly. It is desirable that the discrimination is positive which means that the item differentiates between students in the same way that the total test score does.</Paragraph>
      <Paragraph position="5">  Originally called 'effectiveness'. We chose to term this type of analysis 'usefulness' to distinguish it from the (cost/time) 'effectiveness' of the (semi-) automatic procedure as opposed to the manual construction of tests.</Paragraph>
      <Paragraph position="6">  For experimental purposes, we consider an item to be 'too difficult' if ID g163 0.15 and an item 'too easy' if ID g179 0.85.</Paragraph>
      <Paragraph position="7">  Zero DP is obtained when an equal number of students in each group respond to the item correctly. On the other hand, negative DP is obtained when more students in the lower group than the upper group answer correctly. Items with zero or negative DP should be either discarded or improved.</Paragraph>
      <Paragraph position="8"> did so. Here again T is the total number of students included in the item analysis.</Paragraph>
      <Paragraph position="9">  The average DP for the set of items used in the class test was 0.40. From the analysed test items, there were was only one item that had a negative discrimination.</Paragraph>
      <Paragraph position="10"> (iii) Usefulness of the distractors The usefulness of the distractors is estimated by comparing the number of students in the upper and lower groups who selected each incorrect alternative. A good distractor should attract more students from the lower group than the upper group.</Paragraph>
      <Paragraph position="11"> The evaluation of the distractors estimated the average difference between students in the lower and upper groups to be 1.92. Distractors classed as poor are those that attract more students from the upper group than from the lower group, and there were 6 such distractors. On the other hand, we term distractors not useful if they are selected by no student. The evaluation showed that there were 3 distractors deemed not useful.</Paragraph>
    </Section>
    <Section position="3" start_page="14" end_page="15" type="sub_section">
      <SectionTitle>
4.3 Analysis of the items constructed manually
</SectionTitle>
      <Paragraph position="0"> An experiment worthwhile pursing was to conduct item analysis of the manually produced test items and compare the results obtained regarding the items produced with the help of the program. A set of 12 manually produced items were subjected to the above three types of item analysis. There were 0 too difficult and 1 too easy items. The average item difficulty of the items was 0.59. The average discriminating power was assessed to be 0.25 and there were 2 items with negative discrimination. The evaluation of the usefulness of the distractors resulted in an average difference between students in the upper and lower groups of 1.18. There were 10 distractors that attracted more students from the  Maximum positive DP is obtained only when all students in the upper group answer correctly and no one in the lower group does. An item that has a maximum DP (1.0) would have an ID 0.5; therefore, test authors are advised to construct items at the 0.5 level of difficulty.</Paragraph>
      <Paragraph position="1"> upper group and were therefore, declared as poor and 2 distractors not selected at all, and therefore deemed to be not useful.</Paragraph>
      <Paragraph position="2"> Table 2 summarises the item analysis results for both test items produced with the help of the program and those produced by hand.</Paragraph>
      <Paragraph position="3"> 5. Discussion and plans for future work The evaluation results clearly show that the construction of multiple-choice test items with the help of the program is much more effective than purely manual construction. We believe that this is the main advantage of the proposed methodology. As an illustration, the development of a test databank of considerable size consisting of 1000 items would require 30 hours of human input when using the program, and 115 hours if done manually. This has direct financial implications as the time and cost in developing test items would be dramatically cut.</Paragraph>
      <Paragraph position="4"> At the same time, the test item analysis shows that the quality of test items produced with the help program is not compromised in exchange for time and labour savings. The test items produced with of the program were evaluated as being of very satisfactory quality. As a matter of fact, in many cases they scored even better than those manually produced. Whereas the item difficulty factor assessed for manual items emerges as better  , of those produced with the help of the program, there were only 3 too easy items and 0 too difficult ones. In addition, whilst the values obtained for the discriminating power are not as high as we would have desired, the items produced with the help of the program scored much better on that measure and what is also very important, is that there was only one item among them with negative discrimination (as opposed to 2 from those manually constructed). Finally, the analysis of the distractors confirms that it is not possible to class the manually produced test items as better quality than the ones produced with the help of the program. The test items generated with the help of the program scored  better on the number of distractors deemed as not useful, were assessed to contain fewer poor distractors and had a higher average difference between students in the lower and upper groups.</Paragraph>
      <Paragraph position="5"> In order to ensure a more objective assessment of the efficiency of the procedure, we plan to run the following experiment. At least 6 months after a specific set of items has been produced with the help of the program, the post-editors involved will be asked to produce another, based on the same material, manually.</Paragraph>
      <Paragraph position="6"> Similarly, after such a period items originally produced manually will be produced by the same post-editors with the help of the program. Such an experiment is expected to extinguish any effect of familiarity and to provide a more objective measure as to how computer-aided construction of tests is more effective than manual production.</Paragraph>
      <Paragraph position="7"> It should be noted that the post-editors were not professional test developers. It would be interesting to investigate the impact of the program on professional test developers. This is an experiment envisaged as part of our future work.</Paragraph>
      <Paragraph position="8"> In addition to extending the set of test items to be evaluated and the samples of students taking the test, further work includes experimenting with more sophisticated term extraction techniques and with other more elaborate models for measuring semantic similarity of concepts. We would like to test the feasibility of using collocations from an appropriate domain corpus with a view to extending the choice of plausible distractors. We also envisage the development of a more comprehensive grammar for generating questions, which in turn will involve studying and experimenting with existing question generation theories. As our main objective has been to investigate the feasibility of the methodology, we have so far refrained from more advanced NLP processing of the original documents such as performing anaphora resolution and temporal or spatial reasoning which will certainly allow for more questions to be generated.</Paragraph>
      <Paragraph position="9"> Future work also envisages evaluation as to what extent the questions cover the course material. Finally, even though the agreement between post-editors appears to be a complex issue, we would like to investigate it in more depth. This agreement should be measured on semantic rather than syntactic principles, as the post-editors may produce syntactically different test questions which are semantically equivalent. Similarly, different distractors may be equally good if they are equal in terms of semantic distance to the correct answer.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML