File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2020_metho.xml
Size: 16,316 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2020"> <Title>Learning Information Structure in The Prague Treebank</Title> <Section position="3" start_page="115" end_page="115" type="metho"> <SectionTitle> 2 Prague Dependency Treebank </SectionTitle> <Paragraph position="0"> The Prague Dependency Treebank (PDT) consists of newspaper articles from the Czech National Cor-</Paragraph> <Paragraph position="2"> Cerma'ak, 1997) and includes three layers of annotation. The morphological layer gives a full morphemic analysis in which 13 categories are marked for all sentence tokens (including punctuation marks). The analytical layer, on which the &quot;surface&quot; syntax (HajiVc, 1998) is annotated, contains analytical tree structures, in which every token from the surface shape of the sentence has a corresponding node labeled with main syntactic functions like SUBJ,PRED,OBJ,ADV. The tectogrammatical layer renders the deep (underlying) structure of the sentence (Sgall et al., 1986; HajiVcov'a et al., 1998).</Paragraph> <Paragraph position="3"> Tectogrammatical tree structures (TGTSs) contain nodes corresponding only to the autosemantic words of the sentence (e.g., no preposition nodes) and to deletions on the surface level; the condition of projectivity is obeyed, i.e. no crossing edges are allowed; each node of the tree is assigned a functor such as ACTOR,PATIENT,ADDRESSEE,ORIGIN, EFFECT, the list of which is very rich; elementary coreference links are indicated, in the case of pronouns. null</Paragraph> </Section> <Section position="4" start_page="115" end_page="116" type="metho"> <SectionTitle> 3 Topic Focus Articulation (TFA) </SectionTitle> <Paragraph position="0"> The tectogrammatical level of the PDT was motivated by the more and more obvious need of large corpora that treat not only the morphological and syntactic structure of the sentence but also semantic and discourse-related phenomena. Thus, TGTSs have been enriched with features displaying the information structure of the sentence which is a means of showing its contextual potential.</Paragraph> <Section position="1" start_page="115" end_page="116" type="sub_section"> <SectionTitle> 3.1 Theory </SectionTitle> <Paragraph position="0"> In the Praguian approach to IS, the content of the sentence is divided in two parts: the Topic is &quot;what the sentence is about&quot; and the Focus represents the information asserted about the Topic. A prototypical declarative sentence asserts that its Focus holds (or does not hold) about its Topic: Focus(Topic) or notFocus(Topic). null The TFA definition uses the distinction between Context-Bound (CB) and Non-Bound (NB) parts of the sentence. To distinguish which items are CB and which are NB, the question test is applied, (i.e., the question for which a given sentence is the appropriate answer is considered). In this framework, weak and zero pronouns and those items in the answer which reproduce expressions present (or associated to those present) in the question are CB. Other items are NB.</Paragraph> <Paragraph position="1"> In example (1), (b) is the sentence under investigation, in which CB and NB items are marked, (a) is the context in which the sentence is uttered, and (c) is the question for which the given sentence is an appropriate answer: (1) (a) Tom and Mary both came to John's party.</Paragraph> <Paragraph position="2"> (b) John 1. The main verb and any of its direct dependents belong to the Focus if they are NB; 2. Every item that does not depend directly on the main verb and is subordinated to an element of Focus belongs to Focus (where &quot;subordinated to&quot; is defined as the irreflexive transitive closure of &quot;depend on&quot;); 3. If the main verb and all its dependents are CB, then those dependents k i of the verb which have subordinated items l m that are NB are called 'proxi foci'; the items l m together with all items subordinated to them belong to Focus, where i,m > 1; 4. Every item not belonging to Focus according to 1 - 3 belongs to Topic.</Paragraph> </Section> <Section position="2" start_page="116" end_page="116" type="sub_section"> <SectionTitle> 3.2 Annotation guidelines </SectionTitle> <Paragraph position="0"> Within PDT, the TFA attribute has been annotated for all nodes (including the restored ones) at the tectogrammatical level. Instructions for the assignment of TFA attribute have been specified in (Bur'aVnov'a et al., 2000) and are summarized in Table 1. These instructions are based on the surface word order, the position of the sentence stress (intonation center -</Paragraph> </Section> </Section> <Section position="5" start_page="116" end_page="116" type="metho"> <SectionTitle> IC </SectionTitle> <Paragraph position="0"> ) and the canonical order of the dependents.</Paragraph> <Paragraph position="1"> The TFA attribute has 3 values: t, for non-contrastive CB items; f, for NB items; and c, for contrastive CB items. In this paper, we do not distinguish between contrastive and non-contrastive items, considering both of them as being just t. In the PDT annotation, the values t (from topic) and f (from focus) have been chosen to be used because, in the most cases, in prototypical sentences, t items belong to the Topic and f items to the Focus.</Paragraph> <Paragraph position="2"> Before the manual annotation, the corpus has been preprocessed to mark all nodes with the TFA attribute of f, as it is the more common value. Then the annotators changed the value according to the guidelines in Table 1.</Paragraph> </Section> <Section position="6" start_page="116" end_page="119" type="metho"> <SectionTitle> 4 Automatic extraction of TFA </SectionTitle> <Paragraph position="0"> In this section we consider the automatic identification of t and f using machine learning techniques trained on the annotated data.</Paragraph> <Paragraph position="1"> The data set consists of 1053 files (970,920 words) from the pre-released version of PDT 2.0.</Paragraph> <Paragraph position="2"> We restrict our experiments by considering only noun- and pronoun-nodes. The total number of instances (nouns and pronouns) in the data is 297,220 out of which 254,242 (86.54%) are nouns and 39,978 (13.46%) are pronouns. The t/f distribution of these instances is 172,523 f (58.05%) and 124,697 t (41.95%).</Paragraph> <Paragraph position="3"> We experimented with three different classifiers, C4.5, Bagging and Ripper, because they are based on different machine learning techniques (decision trees, bagging, rules induction) and we wanted to see which of them performs better on this task. We used In the PDT the intonation center is not annotated. However, the annotators were instructed to use their opinion where the IC is when they utter the sentence.</Paragraph> <Paragraph position="4"> We are grateful to our colleagues at the Charles University in Prague for providing us the experimental data before the PDT 2.0 official release.</Paragraph> <Paragraph position="5"> Weka implementations of these classifiers (Witten and Frank, 2000).</Paragraph> <Section position="1" start_page="116" end_page="117" type="sub_section"> <SectionTitle> 4.1 Features </SectionTitle> <Paragraph position="0"> The experiments use two types of features: (1) basic features of the nodes taken directly from the tree-bank (node attributes), and (2) derived features inspired by the annotation guidelines.</Paragraph> <Paragraph position="1"> The basic features are the following (the first 4 are boolean, and 5 and 6 are nominal): 1. is-noun: true, if the node is a noun; 2. is-root: true, if the node is the root of the tree; 3. is-coref-pronoun: true, if the node is a coreferential pronoun; 4. is-noncoref-pronoun: true, if the node is a non-coreferential pronoun (in Czech, many pronouns are used in idiomatic expressions in which they do not have an coreferential function, e.g., sv'eho Vcasu, lit. 'in its (reflexive) time', 'some time ago'); 5. SUBPOS: detailed part of speech which differentiates between types of pronouns: personal, demonstrative, relative, etc.; 6. functor: type of dependency relations: MOD, MANN, ATT, OTHER.</Paragraph> <Paragraph position="2"> The derived features are computed using the dependency information from the tectogrammatical level of the treebank and the surface order of the words corresponding to the nodes . Also, we have used lists of forms of Czech pronouns that are used as weak pronouns, indexical expressions, pronouns with general meaning, or strong pronouns. All the derived features have boolean values: 7. is-rightmost-dependent-of-the-verb; 8. is-rightside-dependent-of-the-verb; 9. is-leftside-dependent; 10. is-embedded-attribute: true, if the node's parent is not the root; 11. has-repeated-lemma: true, in case of nouns, when another node with the same lemma appears in the previous 10 sentences.</Paragraph> <Paragraph position="3"> 12. is-in-canonical-order; 13. is-weak-pronoun; 14. is-indexical-expression; 15. is-pronoun-with-general-meaning; 16. is-strong-pronoun-with-no-prep; On the tectogramatical level in the PDT, the order of the nodes has been changed during the annotation process of the TFA attribute, so that all t items precede all f items. Our features use the surface order of the words corresponding to the nodes.</Paragraph> <Paragraph position="4"> 1. The bearer of the IC (typically, the rightmost child of the verb) f 2. If IC is not on the rightmost child, everything after IC t 3. A left-side child of the verb (unless it carries IC) t 4. The verb and the right children of the verb before the f-node (cf. 1) that are canonically ordered f 5. Embedded attributes (unless repeated or restored) f 6. Restored nodes t 7. Indexical expressions (j'a I, ty you, tVed now, tady here), weak pronouns, pronominal expressions with a general meaning (nVekdo somebody, jednou once) (unless they carry IC) t 8. Strong forms of pronouns not preceded by preposition (unless they carry IC) t</Paragraph> </Section> <Section position="2" start_page="117" end_page="117" type="sub_section"> <SectionTitle> 4.2 Evaluation framework </SectionTitle> <Paragraph position="0"> In order to perform the evaluation, we randomly selected 101,054 instances (1/3 of the data) from all the instances, which represents our test set; the remaining 2/3 of the data we used as a training set.</Paragraph> <Paragraph position="1"> The same test set is used by all three classifiers. In our experiments we have not tweaked the features and thus we have not set aside a development set.</Paragraph> <Paragraph position="2"> In the test set 87% of the instances are nouns and 13% are pronouns. The t/f distribution in the test set is as follows: 58% of the instances are t, and 42% instances are f.</Paragraph> <Paragraph position="3"> We have built models using decision trees (C4.5), bagging and rule-induction (Ripper) machine learning techniques to predict the Information Structure. We have also implemented a deterministic, rule-based system that assigns t or f according to the annotation guidelines presented in Table 1. The rule-based system does not have access to what intonation center (IC) is.</Paragraph> <Paragraph position="4"> The baseline simulates the preprocessing procedure used before the manual annotation of TFA attribute in the PDT, i.e., assigns always the class that has the most instances.</Paragraph> <Paragraph position="5"> Our machine learning models are compared against the baseline and the rule-based system. As a metric we have used the Weighted Averaged F-score which is computed as follows:</Paragraph> <Paragraph position="7"> The reason why we have chosen this metric (instead of Correctly Classified, for example) is that it gives a more realistic evaluation of the system, considering also the distribution of t and f items .</Paragraph> <Paragraph position="8"> Consider, for example, the case in which the test set consists of 70% f items and 30% t items. The Baseline system would</Paragraph> </Section> <Section position="3" start_page="117" end_page="119" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> The results of the experiment using all instances (nouns and pronouns) are shown in Table 2 in the second column. C4.5 and Bagging achieve the best performance improving on the results of the rule-based system by 6.99%.</Paragraph> <Paragraph position="1"> The top of the decision tree generated by C4.5 in the training phase looks like this:</Paragraph> <Paragraph position="3"> The overall tree has 129 leaves out of 161 nodes.</Paragraph> <Paragraph position="4"> In order to achieve a better understanding of the difficulty of the task for nouns and pronouns, we considered evaluations on the following classes of instances: We also wanted to investigate if the three classifiers perform differently with respect to different classes of instances (in which case we could have a general system, that uses more classifiers, and for certain classes of instances we would 'trust' a certain classifier, according to its performance on the development data).</Paragraph> <Paragraph position="5"> have as much as 70% correctly classified instances, just because the t/f distribution is as such. The Weighted Averaged F-score would be in this case 57.64% which is a more adequate value that reflects better the poorness of such a system. results on different classes of instances. The test set for each class of instances represents 1/3 randomly extracted instances from all instances in the data belonging to that class, in the same fashion as for the overall split.</Paragraph> <Paragraph position="6"> The baseline is higher for some classes, yet the classifiers perform always better, even than the rule-based system, which for non-verb children performs worse than the baseline. However, the difference between the three classifiers is very small, and only in one case (for the coreferential pronouns) C4.5 is out-performed by Ripper.</Paragraph> <Paragraph position="7"> To improve the results even more, there are two possibilities: either providing more training data, or considering more features. To investigate the effect of the size of the training data we have computed the learning curves for the three classifiers. Figure 1 shows the C4.5 learning curve for the overall experiment on nouns and pronouns; the learning curves for the other two classifiers are similar, and not included in the figure.</Paragraph> <Paragraph position="8"> The curve is interesting, showing that after only 1% of the training set (1961 instances) C4.5 can already perform well, and adding more training data improves the F-score only slightly. To ensure the initial 1% aren't over-representative of the kind of IS phenomena, we experimented with different 1% parts of the training set, and the results were similar. We also did a 10-fold cross validation experiment on the training set, which resulted in a Weighted Averaged F-score of 82.12% for C4.5.</Paragraph> <Paragraph position="9"> The slight improvement achieved by providing more data indicates that improvements are likely to come from using more features.</Paragraph> <Paragraph position="10"> Table 3 shows the contribution of the two types of features (basic and derived) for the experiment with all instances (nouns and pronouns). For comparison we have displayed again the baseline and the rule-based system F-score.</Paragraph> <Paragraph position="11"> given as a percentage.</Paragraph> <Paragraph position="12"> The results show that the model trained only with basic features performs much better than the baseline, yet it is not as good as the rule-based system. However, removing the basic features completely and keeping only the derived features considerably lowers the score (by more than 4%). This indicates that adding more basic features (which are easy to obtain from the treebank) could actually improve the results.</Paragraph> <Paragraph position="13"> The derived features, however, have the biggest impact on the performance of the classifiers. Yet, adding more sophisticated features that would help in this task (e.g., coreferentiality for nouns) is difficult because they cannot be computed reliably.</Paragraph> </Section> </Section> class="xml-element"></Paper>