File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2605_metho.xml

Size: 18,610 bytes

Last Modified: 2025-10-06 14:10:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2605">
  <Title>Discourse Parsing: Learning FOL Rules based on Rich Verb Semantic Representations to automatically label Rhetorical Relations</Title>
  <Section position="3" start_page="33" end_page="35" type="metho">
    <SectionTitle>
2 Data Collection
</SectionTitle>
    <Paragraph position="0"> The lack of corpora annotated with both rhetorical relations as well as sentence level semantic representation led us to create our own corpus. Resources such as (Kingsbury and Palmer, 2002) and (Carlson et al., 2003) have been developed manually. Since such efforts are time consuming and costly, we decided to semi-automatically build our annotated corpus. We used an existing corpus of instructional text that is about 9MB in size and is made up entirely of written English instructions.</Paragraph>
    <Paragraph position="1"> The two largest components are home repair manuals (5Mb) and cooking recipes (1.7Mb). 3 Segmentation. The segmentation of the corpus was done manually by a human coder. Our segmentation rules are based on those defined in (Mann and Thompson, 1988). For example, (as shown in Example 2) we segment sentences in which a conjunction is used with a clause at the conjunction site.</Paragraph>
    <Paragraph position="2">  (2) You can copy files (//) as well as cut messages. (//) is the segmentation marker. Sentences are segmented into EDUs. Not all the segmentation  rules from (Mann and Thompson, 1988) are imported into our coding scheme. For example, we do not segment relative clauses. In total, our segmentation resulted in 10,084 EDUs. The segmented EDUs were then annotated with rhetorical relations by the human coder4 and also forwarded to the parser as they had to be annotated with semantic information.</Paragraph>
    <Section position="1" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
2.1 Parsing of Verb Semantics
</SectionTitle>
      <Paragraph position="0"> We integrated LCFLEX (Ros'e and Lavie, 2000), a robust left-corner parser, with VerbNet (Kipper et al., 2000) and CoreLex (Buitelaar, 1998). Our interest in decompositional theories of lexical semantics led us to base our semantic representation on VerbNet.</Paragraph>
      <Paragraph position="1"> VerbNet operationalizes Levin's work and accounts for 4962 distinct verbs classified into 237 main classes. Moreover, VerbNet's strong syntactic components allow it to be easily coupled with a parser in order to automatically generate a semantically annotated corpus.</Paragraph>
      <Paragraph position="2"> To provide semantics for nouns, we use CoreLex (Buitelaar, 1998), in turn based on the generative lexicon(Pustejovsky, 1991). CoreLex defines basic types such as art (artifact) or com (communication). Nouns that share the same bundle of basic types are grouped in the same Systematic Polysemous Class (SPC). The resulting 126 SPCs cover about 40,000 nouns.</Paragraph>
      <Paragraph position="3"> We modified and augmented LCFLEX's existing lexicon to incorporate VerbNet and CoreLex. The lexicon is based on COMLEX (Grishman et al., 1994). Verb and noun entries in the lexicon contain a link to a semantic type defined in the ontology. VerbNet classes (including subclasses and frames) and CoreLex SPCs are realized as types in the ontology. The deep syntactic roles are mapped to the thematic roles, which are defined as variables in the ontology types. For more details on the parser see (Terenzi and Di Eugenio, 2003).</Paragraph>
      <Paragraph position="4"> Each of the 10,084 EDUs was parsed using the parser. The parser generates both a syntactic tree and the associated semantic representation - for the purpose of this paper, we only focus on the latter. Figure 2 shows the semantic representation generatedforEDU1fromExample1,&amp;quot;sometimes, you can add a liquid to the water&amp;quot;.</Paragraph>
      <Paragraph position="5"> The semantic representation in Figure 2 is part</Paragraph>
      <Paragraph position="7"> of the F-Structure produced by the parser. The verb add is parsed for a transitive frame with a PP modifier that belongs to the verb class 'MIX-22.12'. The sentence contains two PATIENTs, namely liquid and water. you is identified as the AGENT by the parser. *TOGETHER and *CAUSE are the primitive semantic predicates used by VerbNet.</Paragraph>
      <Paragraph position="8"> Verb Semantics in VerbNet are defined as events thataredecomposedintostages, namelystart, end, during and result. The semantic representation in Figure 2 states that there is an event EVENT0 in which the two PATIENTs are together at the end.</Paragraph>
      <Paragraph position="9"> An independent evaluation on a set of 200 sentences from our instructional corpus was conducted. 5 It was able to generate complete parses for 72.2% and partial parses for 10.9% of the verb frames that we expected it to parse, given the resources. The parser cannot parse those sentences (or EDUs) that contain a verb that is not covered by VerbNet. This coverage issue, coupled with parser errors, exacerbates the problem of data sparseness. This is further worsened by the fact that we require both the EDUs in a relation set to be parsed for the Machine Learning part of our work. Addressing data sparseness is an issue left for future work.</Paragraph>
    </Section>
    <Section position="2" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
2.2 Annotation of Rhetorical Relations
</SectionTitle>
      <Paragraph position="0"> The annotation of rhetorical relations was done manually by a human coder. Our coding scheme builds on Relational Discourse Analysis (RDA) (Moser and Moore, 1995), to which we made mi- null nor modifications; in turn, as far as discourse relations are concerned, RDA was inspired by Rhetorical Structure Theory (RST) (Mann and Thompson, 1988).</Paragraph>
      <Paragraph position="1"> Rhetorical relations were categorized as informational, elaborational, temporal and others. Informational relations describe how contents in two relata are related in the domain. These relations are further subdivided into two groups; causality and similarity. The former group consists of relations between an action and other actions or between actions and their conditions or effects. Relations like 'act:goal', 'criterion:act' fall under this group. The latter group consists of relations between two EDUs according to some notion of similarity such as 'restatement' and 'contrast1:contrast2'. Elaborational relations are interpropositional relations in which a proposition(s) provides detail relating to some aspect of another proposition (Mann and Thompson, 1988). Relations like 'general:specific' and 'circumstance:situation' belong to this category.</Paragraph>
      <Paragraph position="2"> Temporal relations like 'before:after' capture time differences between two EDUs. Lastly, the category others includes relations not covered by the previous three categories such as 'joint' and 'indeterminate'. null Based on the modified coding scheme manual, we segmented and annotated our instructional corpus using the augmented RST tool from (Marcu et al., 1999). The RST tool was modified to incorporate our relation set. Since we were only interested in rhetorical relations that spanned between two adjacent EDUs 6, we obtained 3115 sets of potential relations from the set of all relations that we could use as training and testing data.</Paragraph>
      <Paragraph position="3"> The parser was able to provide complete parses for both EDUs in 908 of the 3115 relation sets.</Paragraph>
      <Paragraph position="4"> These constitute the training and test set for Progol. null ThesemanticrepresentationfortheEDUsalong with the manually annotated rhetorical relations were further processed (as shown in Figure 4) and used by Progol as input.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="35" end_page="38" type="metho">
    <SectionTitle>
3 The Inductive Logic Programming
</SectionTitle>
    <Paragraph position="0"> Framework We chose to use Progol, an Inductive Logic Programming system (ILP), to learn rules based on  6At the moment, we are concerned with learning relations betweentwoEDUsatthebaselevelofaDiscourseParseTree (DPT) and not at higher levels of the hierarchy. the data we collected. ILP is an area of research at the intersection of Machine Learning (ML) and Logic Programming. The general problem specification in ILP is given by the following property:</Paragraph>
    <Paragraph position="2"> Given the background knowledge B and the examples E, ILP systems find the simplest consistent hypothesis H, such that B and H entails E.</Paragraph>
    <Paragraph position="3"> While most of the work in NLP that involves learning has used more traditional ML paradigms like decision-tree algorithms and SVMs, we did not find them suitable for our data which is represented as Horn clauses. The requirement of using a ML system that could handle first order logic data led us to explore ILP based systems of which we found Progol most appropriate.</Paragraph>
    <Paragraph position="4"> Progol combines Inverse Entailment with general-to-specific search through a refinement graph. A most specific clause is derived using mode declarations along with Inverse Entailment.</Paragraph>
    <Paragraph position="5"> All clauses that subsume the most specific clause form the hypothesis space. An A*-like search is used to search for the most probable theory through the hypothesis space. Progol allows arbitrary programs as background knowledge and arbitrary definite clauses as examples.</Paragraph>
    <Section position="1" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
3.1 Learning from positive data only
</SectionTitle>
      <Paragraph position="0"> One of the features we found appealing about Progol, besides being able to handle first order logic data, is that it can learn from positive examples alone.</Paragraph>
      <Paragraph position="1"> Learning in natural language is a universal human process based on positive data only. However, the usual traditional learning models do not work well without negative examples. On the other hand, negative examples are not easy to obtain. Moreover, we found learning from positive data only to be a natural way to model the task of discourse parsing.</Paragraph>
      <Paragraph position="2"> To make the learning from positive data only feasible, Progol uses a Bayesian framework. Progol learns logic programs with an arbitrarily low expected error using only positive data. Of course, we could have synthetically labeled examples of relation sets (pairs of EDUs), that did not belong to a particular relation, as negative examples. We plan to explore this approach in the future.</Paragraph>
      <Paragraph position="3"> A key issue in learning from positive data only using a Bayesian framework is the ability to learn complex logic programs. Without any  negative examples, the simplest rule or logic program, which in our case would be a single definite clause, would be assigned the highest score as it captures the most number of examples. In order to handle this problem, Progol's scoring function exercises a trade-off between the size of the function and the generality of the hypothesis. The score for a given hypothesis is calculated according to formula 4.</Paragraph>
      <Paragraph position="5"> sz(H) and g(H) computes the size of the hypothesis and the its generality respectively. The size of a hypothesis is measured as the number of atoms in the hypothesis whereas generality is measured by the number of positive examples the hypothesis covers. m is the number of examples covered by the hypothesis and dm is a normalizing constant. The function ln p(H|E) decreases with increases insz(H) andg(H). As the number of examples covered (m) grow, the requirements on g(H) become even stricter. This property facilitates the ability to learn more complex rules as they are supported by more positive examples.</Paragraph>
      <Paragraph position="6"> For more information on Progol and the computation of Bayes' posterior estimation, please refer to (Muggleton, 1995).</Paragraph>
    </Section>
    <Section position="2" start_page="36" end_page="38" type="sub_section">
      <SectionTitle>
3.2 Discourse Parsing with Progol
</SectionTitle>
      <Paragraph position="0"> We model the problem of assigning the correct rhetorical relation as a classification task within the ILP framework. The rich verb semantic representation of pairs of EDUs, as shown in Figure 3 7, form the background knowledge and the manually annotated rhetorical relations between the pairs of EDUs, as shown in Figure 4, serve as the positive examples in our learning framework. The numbers in the definite clauses are ids used to identify the EDUs.</Paragraph>
      <Paragraph position="1"> Progol constructs logic programs based on the background knowledge and the examples in Figures 3 and 4. Mode declarations in the Progol input file determines which clause to be used as the head (i.e. modeh) and which ones to be used in the body (i.e. modeb) of the hypotheses. Figure 5 shows an abridged set of our mode declarations.</Paragraph>
      <Paragraph position="2">  Our mode declarations dictate that the predicate relation be used as the head and the other predicates (has possession, transfer and visible) form the body of the hypotheses. '*' indicates that the number of hypotheses to learn for a given relation is unlimited. '+' and '-' signs indicate variables within the predicates of which the former is an input variable and the latter an output variable. '#' is used to denote a constant. Each argument of the predicate is a type, whether a constant or a variable. Types are defined as a single definite clause. Our goal is to learn rules where the LHS of the rule contains the relation that we wish to learn and :- modeh(*,relation(+edu,+edu,#relationtype))? :- modeb(*,has possession(+edu,#event,</Paragraph>
      <Paragraph position="4"> the RHS is a CNF of the semantic predicates defined in VerbNet with their arguments. Given the amount of training data we have, the nature of the data itself and the Bayesian framework used, Progol learns simple rules that contain just one or two clauses on the RHS. 6 of the 68 rules that Progol manages to learn are shown in Figure 6. RULE4 states that there is a theme in motion during the event in EDU A (which is the first EDU) and that the theme is located in location D at the start of the event in EDU B (the second EDU). RULE2 is learned from pairs of EDUs such as in Example 1. The simple rules in Figure 6 may not readily appeal to our intuitive notion of what such rules should include. It is not clear at this point as to how elaborate these rules should be, in order to correctly identify the relation in question. One of the reasons why more complex rules are not learned by Progol is that there aren't enough training examples. As we add more training data in the future, we will see if rules that are more elaborate than the ones in Figure 6 are learned .</Paragraph>
      <Paragraph position="5"> 4 Evaluation of the Discourse Parser Table 1 shows the sets of relations for which we managed to obtain semantic representations (i.e.</Paragraph>
      <Paragraph position="6"> for both the EDUs).</Paragraph>
      <Paragraph position="7">  amples that yielded semantic representations for both EDUs) examples that could potentially be used. For a number of relations, the total number of examples wecouldusewerelessthan50. Forthetimebeing, we decided to use only those relation sets that had more than 50 examples. In addition, we chose not to use Joint and General:specific relations. They will be included in the future. Hence, our training and testing data consisted of the following four relations: Goal:act, Step1:step2, Criterion:act and Before:after. The total number of examples we used was 508 of which 423 were used for training and 85 were used for testing.</Paragraph>
      <Paragraph position="8"> Table 2, Table 3 and Table 4 show the results from running the system on our test data. A total of 85 positive examples were used for testing the system.</Paragraph>
      <Paragraph position="9"> Table 2 evaluates our SemDP system against a baseline. Our baseline is the majority function, which performs at a 51.7 F-Score. SemDP outperforms the baseline by almost 10 percentage points  with an F-Score of 60.24. To the best of our knowledge, we are also not aware of any work that uses rich semantic information for discourse parsing. (Polanyi et al., 2004) do not provide any evaluation results at all. (Soricut and Marcu, 2003) reportthattheirSynDPparserachievedupto63.8F- null Scoreonhuman-segmentedtestdata. Ourresultof 60.24F-ScoreshowsthataDiscourseParserbased purely on semantics can perform as well. However, since the corpus, the size of training data and the set of rhetorical relations we have used differ from (Soricut and Marcu, 2003), a direct comparison cannot be made.</Paragraph>
      <Paragraph position="10"> Table 3 breaks down the results in detail for each of the four rhetorical relations we tested on. Since we are learning from positive data only and the rules we learn depend heavily on the amount of training data we have, we expected the system to be more accurate with the relations that have more training examples. As expected, SemDP did very well in labeling Step1:step2 relations. Surprisingly though, it did not perform as well with Goal:act, even though it had the second highest numberoftrainingexamples(147intotal). Infact, SemDP misclassified more positive test examples for Goal:act than Before:after or Criterion:act, relations which had almost one third the number of  trainingexamples. OverallSemDPachievedaprecision of 61.7 and a Recall of 58.8. In order to find out how the positive test examples were misclassified, we investigated the distribution of the relations classified by SemDP. Table 4 is the confusion matrix that highlights this issue. A majority of the actual Goal:act relations are incorrectly classified as Step1:step1 and Before:after. Likewise, most of the misclassification of actual Step1:step1 seems to labeled as Goal:act or Before:after. Such misclassification occurs because the simple rules learned by SemDP are not able to accurately distinguish cases where positive examples of two different relations share similar semanticpredicates. Moreover,sincewearelearning using positive examples only, it is possible that a positive example may satisfy two or more rules for different relations. In such cases, the rule that has the highest score (as calculated by formula 4) is used to label the unseen example.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML