XML Viewer - p06-1108

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1108_metho.xml
Size: 20,921 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1108">
  <Title>Event Extraction in a Plot Advice Agent</Title>
  <Section position="5" start_page="857" end_page="858" type="metho">
    <SectionTitle>
3 Plot Analysis
</SectionTitle>
    <Paragraph position="0"> To automatically rate student writing many tutoring systems use Latent Semantic Analysis, a variation on the &amp;quot;bag-of-words&amp;quot; technique that uses dimensionality reduction (Graesser et al., 2000).</Paragraph>
    <Paragraph position="1"> We hypothesize that better results can be achieved using a &amp;quot;representational&amp;quot; account that explicitly represents each event in the plot. These semantic relationships are important in stories, e.g., &amp;quot;The thief jumped on the donkey&amp;quot; being distinctly different from &amp;quot;The donkey jumped on the thief.&amp;quot; What characters participate in an action matter, since &amp;quot;The king stole the treasure&amp;quot; reveals a major  misunderstanding while &amp;quot;The thief stole the treasure&amp;quot; shows a correct interpretation by the student.</Paragraph>
    <Section position="1" start_page="858" end_page="858" type="sub_section">
      <SectionTitle>
3.1 Stories as Events
</SectionTitle>
      <Paragraph position="0"> We represent a story as a sequence of events, p1...ph, represented as a list of predicatearguments, similar to the event calculus (Mueller, 2003). Our predicate-argument structure is a minimal subset of first-order logic (no quantifiers), and so is compatible with case-frame and dependency representations. Every event has a predicate (function) p that has one or more arguments, n1...na.</Paragraph>
      <Paragraph position="1"> In the tradition of Discourse Representation Theory (Kamp and Reyle, 1993), our current predicate argument structure could be converted automatically to first order logic by using a default existential quantification over the predicates and joining them conjunctively. Predicate names are often verbs, while their arguments are usually, although not exclusively, nouns or adjectives. When describing a set of events in the story, a superscript is used to keep the arguments in an event distinct, as n25 is argument 2 in event 5. The same argument name may appear in multiple events. The plot of any given story is formalized as an event structure composed of h events in a partial order, with the partial order denoting their temporal order: p1(n11,n21,...na1),....,ph(n2h,n4h...nch) An example from the &amp;quot;Thief&amp;quot; exemplar story is &amp;quot;The Queen nagged the king to build a treasure chamber. The king decided to have a treasure chamber.&amp;quot; This can be represented by an event structure as:</Paragraph>
      <Paragraph position="3"> Note due the ungrammatical corpus we cannot at this time extract neo-Davidsonian events. A sentence maps onto one, multiple, or no events. A unique name and closed-world assumption is enforced, although for purposes of comparing event we compare membership of argument and predicate names in WordNet synsets in addition to exact name matches (Fellbaum, 1998).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="858" end_page="859" type="metho">
    <SectionTitle>
4 Extracting Events
</SectionTitle>
    <Paragraph position="0"> Paralleling work in summarization, it is hypothesized that the quality of a rewritten story can be defined by the presence or absence of &amp;quot;semantic content units&amp;quot; that are crucial details of the text that may have a variety of syntactic forms (Nenkova and Passonneau, 2004). We further hypothesize these can be found in chunks of the text automatically identified by a chunker, and we can represent these units as predicate-arguments in our event structure. The event structure of each story is automatically extracted using an XML-based pipeline composed of NLP processing modules, and unlike other story systems, extract full events instead of filling in a frame of a story script (Riloff, 1999). Using the latest version of the Language Technology Text Tokenization Toolkit (Grover et al., 2000), words are tokenized and sentence boundaries detected. Words are given part-of-speech tags by a maximum entropy tagger from the toolkit. We do not attempt to obtain a full parse of the sentence due to the highly irregular nature of the sentences. Pronouns are resolved using a rule-based reimplementation of the CogNIAC algorithm (Baldwin, 1997) and sentences are lemmatized and chunked using the Cass Chunker (Abney, 1995). It was felt the chunking method would be the only feasible way to retrieve portions of the sentences that may contain complete &amp;quot;semantic content units&amp;quot; from the ungrammatical and irregular text. The application of a series of rules, mainly mapping verbs to predicate names and nouns to arguments, to the results of the chunker produces events from chunks as described in our previous work (McNeill et al., 2006). The accuracy of our rule-set was developed by using the grammatical exemplar stories as a testbed, and a blind judge found they produced 68% interpretable or &amp;quot;sensible&amp;quot; events given the ungrammatical text. Students usually use the present or past tense exclusively throughout the story and events are usually presented in order of occurrence. An inspection of our corpus showed 3% of stories in our corpus seemed to get the order of events wrong (Hickmann, 2003).</Paragraph>
    <Section position="1" start_page="858" end_page="859" type="sub_section">
      <SectionTitle>
4.1 Comparing Stories
</SectionTitle>
      <Paragraph position="0"> Since the student is rewriting the story using their own words, a certain variance from the plot of the exemplar story should be expected and even rewarded. Extra statements that may be true, but are not explicitly stated in the story, can be inferred by the students. Statements that are true but are not highly relevant to the course of the  plot can likewise be left out. Word similarity must be taken into account, so that &amp;quot;The king is protecting his gold&amp;quot; can be recognized as &amp;quot;The pharaoh guarded the treasure.&amp;quot; Characters change in context, as one character that is described as the &amp;quot;younger brother&amp;quot; is from the viewpoint of his mother &amp;quot;the younger son.&amp;quot; So, building a model from the events of two stories and simply checking equivalence can not be used for comparison, since a wide variety of partial equivalence must be taken into account.</Paragraph>
      <Paragraph position="1"> Instead of using absolute measures of equivalence based on model checking or measures based on word distribution, we compare each story on the basis of the presence or absence of events. This approach takes advantage of WordNet to define synonym matching and uses the relational structure of the events to allow partial matching of predicate functions and arguments. The events of the exemplar story are assumed to be correct, and they are searched for in the rewritten story in the order in which they occur in the exemplar. If an event is matched (including using WordNet), then in turn each of the arguments attempts to be matched.</Paragraph>
      <Paragraph position="2"> This algorithm is given more formally in Figure 1. The complete event structure from the exemplar story, E, and the complete event structure from the rewritten story R, with each individual event predicate name labelled as e and r respectively, and their arguments labelled as n in either Ne and Nr. SYN(x) is the synset of the term x, including hypernyms and hyponyms except upper ontology ones. The results of the algorithm are stored in binary vector F with index i. 1 denotes an exact match or WordNet synset match, and 0 a failure to find any match.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="859" end_page="860" type="metho">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> As a baseline system LSA produces a similarity score for each rewritten story by comparing it to the exemplar, this score is used as a distance metric for a k-Nearest Neighbor classifier (Deerwester et al., 1990). The parameters for LSA were empirically determined to be a dimensionality of 200 over the semantic space given by the recommended reading list for American 6th graders (Landauer and Dumais, 1997). These parameters resulted in the LSA similarity score having a Pearson's correlation of -.520 with Rater A. k was found to be optimal at 9.</Paragraph>
    <Paragraph position="1">  The results of the plot comparison algorithm were given as features to machine-learners, with results produced using ten-fold cross-validation. A Naive Bayes learner discovers the different statistical distributions of events for each rating. The results for both the &amp;quot;Adventure&amp;quot; and &amp;quot;Thief&amp;quot; stories are displayed in Table 2. &amp;quot;PLOT&amp;quot; means the results of the Plot Comparison Algorithm were used as features for the machine-learner while &amp;quot;LSA&amp;quot; means the similarity scores for Latent Semantic Analysis were used instead. Note that the same machine-learner could not be used to judge the effect of LSA and PLOT since LSA scores are real numbers and PLOT a set of features encoded as binary vectors.</Paragraph>
    <Paragraph position="2"> The results do not seem remarkable at first glance. However, recall that the human raters had an average of 56% agreement on story ratings, and in that light the Naive Bayes learner approaches the performance of human raters. Surprisingly, when the LSA score is used as a feature in addition to the results of the plot comparison algorithm for the Naive Bayes learners, there is no further improvement. This shows features given by the event  structure better characterize plot structure than the word distribution. Unlike previous work, the use of both the plot comparison results and LSA did not improve performance for Naive Bayes, so the results of using Naive Bayes with both are not reported (Halpin et al., 2004).</Paragraph>
    <Paragraph position="3"> The results for the &amp;quot;Adventure&amp;quot; corpus are in general better than the results for the &amp;quot;Thief&amp;quot; corpus. However, this is due to the &amp;quot;Thief&amp;quot; corpus being smaller and having an infrequent number of &amp;quot;Excellent&amp;quot; and &amp;quot;Poor&amp;quot; stories, as shown in Table 1. In the &amp;quot;Thief&amp;quot; corpus the learner simply collapses most stories into &amp;quot;Good,&amp;quot; resulting in very poor performance. Another factor may be that the &amp;quot;Thief&amp;quot; story was more complex than the &amp;quot;Adventure&amp;quot; story, featuring 9 characters over 5 scenes, as opposed to the &amp;quot;Adventure&amp;quot; corpus that featured 4 characters over 2 scenes.</Paragraph>
    <Paragraph position="4"> For the &amp;quot;Adventure&amp;quot; corpus, the Naive Bayes classifier produces the best results, as detailed in Table 4 and the confusion matrix in Figure 3. A close inspection of the results shows that in the &amp;quot;Adventure Corpus&amp;quot; the &amp;quot;Poor&amp;quot; and &amp;quot;Good&amp;quot; stories are classified in general fairly well by the Naive Bayes learner, while some of the &amp;quot;Excellent&amp;quot; stories are classified as correctly. A significant number of both &amp;quot;Excellent&amp;quot; and most &amp;quot;Fair&amp;quot; stories are classified as &amp;quot;Good.&amp;quot; The &amp;quot;Fair&amp;quot; category, due to its small size in the training corpus, has disappeared. No &amp;quot;Poor&amp;quot; stories are classified as &amp;quot;Excellent,&amp;quot; and no &amp;quot;Excellent&amp;quot; stories are classified as &amp;quot;Poor.&amp;quot; The increased difficulty in distinguishing &amp;quot;Excellent&amp;quot; stories from &amp;quot;Good&amp;quot; stories is likely due to the use of inference by &amp;quot;Excellent&amp;quot; stories, which our system does not use. An inspection of the rating scale's wording reveals the similarity in wording between the &amp;quot;Fair&amp;quot; and &amp;quot;Good&amp;quot; ratings. This may explain the lack of &amp;quot;Fair&amp;quot; stories in the corpus and therefore the inability of machine-learners to recognize them. As given by a survey of five teachers experienced in using the story rewriting task in schools, this level of performance is not ideal but acceptable to teachers.</Paragraph>
    <Paragraph position="5"> Our technique is also shown to be easily portable over different domains where a teacher can annotate around one hundred sample stories using our scale, although performance seems to suffer the more complex a story is. Since the Naive Bayes classifier is fast (able to classify stories in only a few seconds) and the entire algorithm from training to advice generation (as detailed below) is fully automatic once a small training corpus has been produced, this technique can be used in real-life tutoring systems and easily ported to other stories. null</Paragraph>
  </Section>
  <Section position="8" start_page="860" end_page="862" type="metho">
    <SectionTitle>
5 Automated Advice
</SectionTitle>
    <Paragraph position="0"> The plot analysis agent is not meant to give the students grades for their stories, but instead use the automatic ratings as an intermediate step to produce advice, like other hybrid tutoring systems (Rose et al., 2002). The advice that the agent can generate from the automatic rating classification is limited to coarse-grained general advice. However, by inspecting the results of the plot comparison algorithm, our agent is capable of giving detailed fine-grained specific advice from the relationships of the events in the story. One tutoring system resembling ours is the WRITE system, but we differ from it by using event structure to represent the information in the system, instead of using rhetorical features (Burstein et al., 2003). In this regards it more closely resembles the physics tutoring system WHY-ATLAS, although we deal with narrative stories of a longer length than physics essays. The WHY-ATLAS physics tutor identifies missing information in the explanations of students using theorem-proving (Rose et al., 2002).</Paragraph>
    <Section position="1" start_page="860" end_page="861" type="sub_section">
      <SectionTitle>
5.1 Advice Generation Algorithm
</SectionTitle>
      <Paragraph position="0"> Different types of stories need different amounts of advice. An &amp;quot;Excellent&amp;quot; story needs less advice than a &amp;quot;Good&amp;quot; story. One advice statement is &amp;quot;general,&amp;quot; while the rest are specific. The system  produces a total of seven advice statements for a &amp;quot;Poor&amp;quot; story, and two less statements for each rating level above &amp;quot;Poor.&amp;quot; With the aid of a teacher, a number of &amp;quot;canned&amp;quot; text statements offering general advice were created for each rating class. These include statements such as &amp;quot;It's very good! I only have a few pointers&amp;quot; for a &amp;quot;Good&amp;quot; story and &amp;quot;Let's get help from the teacher&amp;quot; for &amp;quot;Poor&amp;quot; story. The advice generation begins by randomly selecting a statement suitable for the rating of the story. Those students whose stories are rated &amp;quot;Poor&amp;quot; are asked if they would like to re-read the story and ask a teacher for help.</Paragraph>
      <Paragraph position="1"> The generation of specific advice uses the results of the plot-comparison algorithm to produce specific advice. A number of advice templates were produced, and the results of the Advice Generation Algorithm fill in the needed values of the template. The ph most frequent events in &amp;quot;Excellent&amp;quot; stories are called the Important Event Structure, which represents the &amp;quot;important&amp;quot; events in the story in temporal order. Empirical experiments led us ph = 10 for the &amp;quot;Adventure&amp;quot; story, but for longer stories like the &amp;quot;Thief&amp;quot; story a larger ph would be appropriate. These events correspond to the ones given the highest weights by the Naive Bayes algorithm. For each event in the event structure of a rewritten story, a search for a match in the important event structure is taken. If a predicate name match is found in the important event structure, the search continues to attempt to match the arguments. If the event and the arguments do not match, advice is generated using the structure of the &amp;quot;important&amp;quot; event that it cannot find in the rewritten story.</Paragraph>
      <Paragraph position="2"> This advice may use both the predicate name and its arguments, such as &amp;quot;Did the stork fly?&amp;quot; from fly(stork). If an argument is missing, the advice may be about only the argument(s), like &amp;quot;Can you tell me more about the stork?&amp;quot; If the event is out of order, advice is given to the student to correct the order, as in &amp;quot;I think something with the stork happened earlier in the story.&amp;quot; This algorithm is formalized in Figure 2, with all variables being the same as in the Plot Analysis Algorithm, except that W is the Important Event Structure composed of events w with the set of arguments Nw. M is a binary vector used to store the success of a match with index i. The ADV function, given an event, generates one ad- null vice statement to be given to the student.</Paragraph>
      <Paragraph position="3"> An element of randomization was used to generate a diversity of types of answers. An advice generation function (ADV ) takes an important event (w) and its binary matching vector (M) and generates an advice statement for w. Per important event this advice generation function is parameterized so that it has a 10% chance of delivering advice based on the entire event, 20% chance of producing advice that dealt with temporal order (these being parameters being found ideal after testing the algorithm), and otherwise produces advice based on the arguments.</Paragraph>
    </Section>
    <Section position="2" start_page="861" end_page="861" type="sub_section">
      <SectionTitle>
5.2 Advice Evaluation
</SectionTitle>
      <Paragraph position="0"> The plot advice algorithm is run using a randomly selected corpus of 20 stories, 5 from each plot rating level using the &amp;quot;Adventure Corpus.&amp;quot; This produced matching advice for each story, for a total of 80 advice statements.</Paragraph>
    </Section>
    <Section position="3" start_page="861" end_page="862" type="sub_section">
      <SectionTitle>
5.3 Advice Rating
</SectionTitle>
      <Paragraph position="0"> An advice rating scheme was developed to rate the  advice produced in consultation with a teacher. 1. Excellent: The advice was suitable for the story, and helped the student gain insight into the story.</Paragraph>
      <Paragraph position="1"> 2. Good: The advice was suitable for the story,  and would help the student.</Paragraph>
      <Paragraph position="2"> 3. Fair: The advice was suitable, but should have been phrased differently.</Paragraph>
      <Paragraph position="3"> 4. Poor: The advice really didn't make sense  and would only confuse the student further. Before testing the system on students, it was decided to have teachers evaluate how well the advice given by the system corresponded to the advice they would give in response to a story. A teacher read each story and the advice. They then rated the advice using the advice rating scheme. Each story was rated for its overall advice quality, and then each advice statement was given comments by the teacher, such that we could derive how each individual piece of advice contributed to the global rating. Some of the general &amp;quot;coarse-grained&amp;quot; advice was &amp;quot;Good! You got all the main parts of the story&amp;quot; for an &amp;quot;Excellent&amp;quot; story, &amp;quot;Let's make it even better!&amp;quot; for a &amp;quot;Good&amp;quot; story, and &amp;quot;Reading the story again with a teacher would be help!&amp;quot; for a &amp;quot;Poor&amp;quot; story. Sometimes the advice generation algorithm was remarkably accurate. In one story the connection between a curse being lifted by the possession of a coin by the character Nils was left out by a student. The advice generation algorithm produced the following useful advice statement: &amp;quot;Tell me more about the curse and Nils.&amp;quot; Occasionally an automatically extracted event that is difficult to interpret by a human or simply incorrectly is extracted. This in turn can cause advice that does not make any sense can be produced, such as &amp;quot;Tell me more about a spot?&amp;quot;. Qualitative analysis showed that &amp;quot;missing important advice&amp;quot; to be the most significant problem, followed by &amp;quot;nonsensical advice.&amp;quot;</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML