File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0207_intro.xml
Size: 18,221 bytes
Last Modified: 2025-10-06 14:02:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0207"> <Title>Text Type Structure and Logical Document Structure</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 3 Data and annotation levels </SectionTitle> <Paragraph position="0"> We carried out the experiments on a corpus of 47 linguistic research articles, taken from the German online journal 'Linguistik Online',1 from the volumes 2000-2003. The selected articles came with HTML markup and have an average length of 8639 word forms, dealing with subjects as diverse as the syntax of adverbs, chat analysis, and language learning.</Paragraph> <Paragraph position="1"> Taking a text-technological approach, this corpus was prepared such that all required types of information, including the target classification categories and the classification features to be extracted, are realized as XML annotations of the raw text.</Paragraph> <Paragraph position="2"> Thus, XML markup was provided for the thematic level, a logical structure level, and a grammatical level. As described in Bayerl et al. (2003), annotation levels are distinguished from annotation layers. An annotation level is an abstract level of information (such as the morphology and syntax levels in linguistics), originally independent of any annotation scheme. The term annotation layer, in contrast, refers to the realization of an annotation level as e.g. XML markup. There need not be a 1:1correspondence between annotation levels and layers. As for the three annotation levels in our setting, one (the structural level) was realized as an independent layer, and two (thematic and grammatical) were realized in one single annotation layer. Each annotation layer of an article is stored in a separate file, while it is ensured that the PCDATA of each layer are identical.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Annotation of text type structure </SectionTitle> <Paragraph position="0"> The order of topic types in a specific scientific article may deviate from the canonical order represented in the XML schema grammar of the text type structure shown in Figure 2. Thus a flat version of the hierarchical XML schema was derived by means of an XSLT style sheet, exploiting the fact that XML schema grammars, unlike DTDs, are XML documents themselves. In the derived flat XML schema, topic types are represented as attribute values of elements called <group> and <segment>, instead of names of nested elements. Empty <group> elements represent topic types that corresponded to the nodes in the original tree of topic types, while <segment> elements correspond to leaves (terminal categories). The original hierarchical structure is still represented via the ID/IDREF attributes id and parent, similar to O'Donnell's (2000) representation of rhetorical structure trees.</Paragraph> <Paragraph position="1"> For the annotation, the raw text of each article was automatically partitioned into text segments corresponding to sentences, but the annotators were allowed to modify (join or split) segments to yield proper thematic units. The problem of finding thematic boundaries other than sentence boundaries automatically (e.g. Utiyama and Isahara (2001)) is thus not addressed in this work. The annotator then provided the values of the attribute topic using the XML spy editor, choosing exactly one of the 16 terminal topic types for each segment, or alternatively the category void meta for metadata such as acknowledgements. If more than one topic type could in principle be assigned, the annotators were instructed to choose the one that was most central to the argumentation. An extract from a THM annotation layer is shown in Figure 3.2 The two annotators were experienced in that they had received intense training as well as annotated a corpus of psychological articles according to an extended version of the schema in Figure 1 earlier (Bayerl et al., 2003). We assessed inter-rater reliability on three articles from the present linguistics corpus, which were annotated by both annotators independently according to the topic type set shown in Figure 1. (Prior to the analysis the articles were 2The extract, which is also shown in Figure 4, is taken from B&quot;uhlmann (2002).</Paragraph> <Paragraph position="2"> <segment id=&quot;s75a&quot; parent=&quot;g19&quot; topic=&quot;dataAnalysis&quot;> Die obige Reihenfolge ver&quot;andert sich etwas, wenn nicht die gesamte Anzahl der Personenbezeichnungen ausschlaggebend ist, sondern die Anzahl unterschiedlicher Personenbezeichnungen (das heisst, eine Personenbezeichnung wie z.B. Jugendliche, die acht Mal verwendet wurde, wird trotzdem nur einmal gez&quot;ahlt): </segment> <segment id=&quot;s76&quot; parent=&quot;g4&quot; topic=&quot;results&quot;> Im ganzen kommen in den untersuchten Artikel 261 verschiedene Personenbezeichnungen vor.</Paragraph> <Paragraph position="3"> Davon sind &quot;uber 46,7% generische Maskulina, und nur 31% sind Institutions- und Kollektivbezeichnungen. Es folgen die geschlechtsneutralen und -abstrakten Bezeichnungen mit 18,4%, und nach wie vor stehen die Doppelformen mit 3,8% Bezeichnungen resegmented manually so that segment boundaries were completely identical.) An average agreement of Kappa = 0.73 was reached (min: .63, max: .78), which can be interpreted as 'substantial' agreement (Landis and Koch, 1977). In order to test for annotation biases we also performed a Stuart-Maxwell Test (Stuart, 1955; Maxwell, 1970), leading to the conclusion that marginal homogeneity must be rejected on the 1% level (kh2 = 61.24; df = 14). The McNemar Tests (McNemar, 1947) revealed that the topic types textual, results, interpretation, others-Work, and conclusions were the problematic categories. Subsequent log-linear analyses revealed that annotator1 systematically had assigned background where annotator2 had assigned framework. Also interpretation was regularly confused with conclusions, and concepts with either background or othersWork (model-fit: kh2 = 173.14,df = 155,p = .15).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Annotation of syntax and morphology </SectionTitle> <Paragraph position="0"> For an annotation of grammatical categories to word form tokens in our corpus, the commercial tagger Machinese Syntax by Connexor Oy was employed.</Paragraph> <Paragraph position="1"> This tagger is a rule-based, robust syntactic parser available for several languages and based on Constraint Grammar and Functional Dependency Grammar (Tapanainen and J&quot;arvinen, 1997). It provides morphological, surface syntactic, and functional tags for each word form and a dependency structure for sentences, and besides is able to process and output &quot;simple&quot; XML (that is, XML without attributes). No conflicts in terms of element overlaps can arise between our THM annotation layer and the grammatical tagging, because all tags provided by Machinese Syntax pertain to word forms.</Paragraph> <Paragraph position="2"> The grammatical annotations could therefore be integrated with the THM annotations, forming the XML annotation layer that we call THMCNX. An XSLT stylesheet is applied to convert the THM annotations into attribute-free XML by integrating the information from attribute-value specifications into the names of their respective elements. After the grammatical tagging, a second stylesheet re-converts the resulting attribute-free XML representations into the original complex XML enriched by the grammatical tags. Besides, we re-formatted the original Machinese Syntax tags by omitting, merging, and renaming some of them, again using XSLT. The <cmp-head-lemma> tag (containing the lemma of the head of the present word form), for example, was derived from the original <lemma> tag, the value of which contains compound segmentation information. On the THMCNX layer, a sub-set of 15 grammatical tags may appear at each word form, including <pos> (part of speech), <aux> (auxiliary verb), and <num> (number feature for nominal categories).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Logical document structure annotation </SectionTitle> <Paragraph position="0"> Since HTML is a hybrid markup language including a mixture of structural and layout information, we chose to convert the original HTML of the corpus into XML based on the DocBook standard (Walsh and Muellner, 1999). DocBook was originally designed for technical documentation and represents a purely logical document structure, relying on style sheets to interpret the logical elements to produce a desired layout. We did not employ the whole, very large official DocBook DTD, but designed a new XML schema that defines a subset with 45 DocBook elements plus 13 additional logical elements such as tablefootnote and numexample, which appear in the annotations after the namespace prefix log.3 The annotations were obtained using a perl script that provided raw DocBook annotations from the HTML markup, and the XML spy editor for validation and manually filling in elements that have no correspondences in HTML. Figure 4 shows the DocBook annotation of the extract that was also given in Figure 3.</Paragraph> <Paragraph position="1"> Moreover, structural position attributes were added to each element by means of an XSLT style sheet. These 'POSINFO' attributes make explicit the position of the element in the XML DOM tree of 3This XML schema was designed in collaboration with the HyTex project at the University of Dortmund, the document instance in an XPATH-expression as shown in Figure 5.</Paragraph> <Paragraph position="2"> As pointed out above, XML document structure has been exploited formerly in the automatic classification of complete documents, e.g. in (Yi and Sundaresan, 2000; Denoyer and Gallinari, 2003). However, we want to use XML document structure in the classification of thematic segments of documents, where the thematic segments are XML elements in the THM annotation layer. The THM and DOC layers cannot necessarily be combined in a single layer, as we had refrained from imposing the constraint that they always should be compatible, i.e. not contain overlaps. Still we had to relate element instances on the DOC layer to element instances on the THM layer.</Paragraph> <Paragraph position="3"> For this purpose, we resorted to the Prolog query tool seit.pl developed at the University of Bielefeld in the project Sekimo4 for the inference of re- null lations between two annotation layers of the same text. seit.pl infers 13 mutually exclusive relations between instances of element types on separate annotation layers on account of their shared PCDATA. In view of the application we envisaged, we defined four general relations, one of which was Identity and three of which were defined by the union of several more specific seit.pl relations: Identity: The original identity relation from seit.pl.</Paragraph> <Paragraph position="4"> Included: Holds if a thematic segment is properly included in a DocBook element in terms of the ranges of the respective PCDATA, i.e. is defined as the union of the original seit.pl-relations included A in B, starting point B and end point B. This relation was considered to be significant because we would for example expect THM segments annotated with the topic type interpretation to appear within /article[1]/sect1[5] rather than /article[1]/sect1[1] elements (i.e. the fifth rather than the first sect1 element).</Paragraph> <Paragraph position="5"> Includes: Holds if a thematic segment properly includes a DocBook element in terms of the ranges of the respective PCDATA, i.e. is defined as the union of the original seit.pl relations included B in A, starting point A, end point A. This relation was considered to be significant because we would for example expect logical elements such as numexample to be included preferably in segments labelled with the topic type data.</Paragraph> <Paragraph position="6"> Overlap: Holds if a thematic segment properly overlaps with a DocBook element in terms of the ranges of the respective PCDATA. This relation was considered less significant because the overlapping portion of PCDATA might be very small and seit.pl so far does not allow for querying how large the overlapping portion actually is.</Paragraph> <Paragraph position="7"> The Prolog code of seit.pl was modified such that it outputs XML files that contain the THM annotation layer including structural positions from the DOC layer within each segment as values of elements that indicate the relation found, cf. Figure We applied different classification models, namely a KNN classifier (cf. section 4.1) and, for purposes of comparison, a simplified Rocchio classifier to text segments, in order to evaluate the feasibility of an automatic annotation of scientific articles according to our THM annotation layer. One important motivation for these experiments was to find out which kind of data representation yields the best classification accuracy, and particularly, if the combination of complementary information sources, such as bag-of-words representations of text, on the one hand, and the structural information provided by the DocBook path annotations, on the other hand, produces additional synergetic effects.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 KNN classification </SectionTitle> <Paragraph position="0"> The basic idea of the K nearest neighbor (KNN) classification algorithm is to use already categorized examples from a training set in order to assign a category to a new object. The first step is to choose the K nearest neighbors (i.e. the K most similar objects according to some similarity metric, such as cosine) from the trainings set. In a second step the categorial information of the nearest neighbors is combined, in the simplest case, by determining the majority class.</Paragraph> <Paragraph position="1"> The version of KNN classification, adopted here, uses the Jensen-Shannon divergence (also known as information radius or iRad) as a</Paragraph> <Paragraph position="3"> iRad ranges from 0 (identity) to 2log2 (no similarity) and requires that the compared objects are probability distributions.</Paragraph> <Paragraph position="4"> Let NO,C = {n1,...,nm} (0 [?] m [?] K) be the set of those objects among the K nearest neighbors of some new object O that belong to a particular category C. Then the score assigned to the classification O [?] C is</Paragraph> <Paragraph position="6"> iRad(O,nj)E.</Paragraph> <Paragraph position="7"> Depending on the choice of E, one yields either a simple majority decision (if E = 0), a linear weighting of the iRad similarity (if E = 1), or a stronger emphasis on closer training examples (if E > 1). Actually, it turned out that very high values of E improved the classification accuracy. Finally, the KNN scores for each segment were normalized to probability distributions, in order to get comparable results for different K and E, when the KNN classifications get combined with the bigram model.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Bigram model </SectionTitle> <Paragraph position="0"> The bigram model gives the conditional probability of a topic type Tn+1, given its predecessor Tn.</Paragraph> <Paragraph position="1"> For a sequence of segments s1 ...sm the total score t(T,si) for the assignment of a topic type T to si is the product of bigram probability, given the putative predecessor topic type (i.e. the topic type Tprime with the highest t(Tprime,si[?]1) computed in the previous step), and the normalized score of the KNN classifier. The total score of the topic type sequence is the product of its t scores.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Information sources </SectionTitle> <Paragraph position="0"> In our classification experiments we used six different representations which can be viewed as different feature extraction strategies or different levels of abstraction: null * word forms (wf): a bag-of-words representation of the segment without morphological analysis; special characters (punctuation, braces, etc.) are treated as words.</Paragraph> <Paragraph position="1"> * compound heads (ch): stems; in case of compounds, the head is used instead of the whole compound. These features were extracted from the THMCNX layer (cf. section 3.2).</Paragraph> <Paragraph position="2"> * size (sz): number of words per segment (calculation based on the THM annotation layer, cf. section 3.1).</Paragraph> <Paragraph position="3"> * DocBook paths (dbp): the segment is represented as the set of the DocBook paths which include it (the segment stand in the the Included relation to it as explained in section 3.3).</Paragraph> <Paragraph position="4"> * selected DocBook features (df): a set of 6 DocBook features which indicate occurrences of block quotes, itemized lists, numbered examples, ordered lists, tables, and references to footnotes standing in any of the four relations listed in section 3.2.</Paragraph> <Paragraph position="5"> * POS tags (pos): the distribution of part-of-speech tags of the segment taken from the THMCNX layer (cf. section 3.2).</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Training and Evaluation </SectionTitle> <Paragraph position="0"> For each test document the bigram model and the classifier were trained with all other documents.</Paragraph> <Paragraph position="1"> The overall size of the data collection was 47 documents. Thus, each classifier and each bigram model has been trained on the basis of 46 documents, respectively. The total number of segments was 7330.</Paragraph> </Section> </Section> class="xml-element"></Paper>