File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2715_metho.xml
Size: 10,116 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2715"> <Title>Multidimensional markup and heterogeneous linguistic resources</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction - Why (and how) to use </SectionTitle> <Paragraph position="0"> heterogeneous linguistic resources A large and diverse amount of linguistic resources (audio and video recodings, textual recordings) has been piled up during various projects all over the world. A reasonable subset of these resources consists of machine-readable structured linguistic documents (often XML annotated), dictionaries, grammars or ontologies. Sometimes these are available to the public on the Web, cf. Simons (2004). The availability allows for the sophisticatedexaminationoflinguisticquestionsand null the reuse of existing linguistic material. Especially corpora annotated for discourse-related phenomena have become an important source for various linguistic studies. Besides annotated corpora external knowledge bases like lexical nets (e.g., WordNet, GermaNet) and grammars, can be used to support several linguistic processes.</Paragraph> <Paragraph position="1"> Although XML has recently established itself as the technology and format of choice the before mentioned resources remain heterogeneous 1The work presented in this paper is part of the project A2 Sekimo of the Research Group Text-technological modelling of information funded by the German Research Foundation. in respect of the data format (i.e., the underlying schema) or the functionality provided. A simple approach to make use of different resources is to usethemonebyone,startingwithrawtextdata(or annotated XML) as input and providing the output of the first process (e.g., a tagger) as input for the next step (e.g., a parser). However, this method may lead to several problems. One possible problem of this method is that the output format of one processing resource can be unemployable as input format for the next. Another potential problem of using XML annotated documents is overlapping annotation. And finally it is sometimes necessary (or desirable) to process only parts of the input document.</Paragraph> <Paragraph position="2"> The structure of the paper is as follows: In Section 2 our approach of representing multiple annotations is desribed, in Section 3 the use of multiroot trees for the representation of heterogeneous ressources is presented. As a case study, the resolution of anaphoric relations is described in Section 4.</Paragraph> </Section> <Section position="3" start_page="0" end_page="85" type="metho"> <SectionTitle> 2 Multiple annotations </SectionTitle> <Paragraph position="0"> Representing data corresponding to different levels of annotation is a fundamental problem of texttechnological research. Renear et al. (1996) discuss the OHCO-Thesis2 as one of the basic assumptionsaboutthestructureoftextandshowthat null thisassumptioncannotbeupheldconsistently. Being based on the OHCO-Thesis most markup languages (including SGML and XML) are designed in principle to represent one structure per document. Options to resolve this problem are discussed in Barnard et al. (1995) and several other (1990) states &quot;that text is best represented as an ordered hierarchy of content object (OHCO).&quot; proposals. To avoid drawbacks of the above mentioned approaches Witt (2002; 2004) discusses an XML based solution which is used in our project.</Paragraph> <Section position="1" start_page="85" end_page="85" type="sub_section"> <SectionTitle> 2.1 Representation </SectionTitle> <Paragraph position="0"> We address the issue of overlapping markup by using separate annotations of relevant phenomena (e.g., syntactic information, POS, document structure) according to different document grammars, i.e., the same textual data is encoded several times (in separate files). One advantage of this multiple annotation is that the modeling of information on a level A is not dependent in any way on a level B (in contrast to the standoff annotation model described by Thompson and McKelvie (1997), where a primary modeling level is needed). Additional annotation layers can be added easily without any changes to the layers already established.</Paragraph> <Paragraph position="1"> The primary data (i.e., the text which will be annotated) is separated from the markup and serves as key element for establishing relations between the annotation tiers. Witt et al. (2005) describe a knowledge representation format (in the programminglanguageProlog3)whichcanbeusedtodraw null inferences between separate levels of annotations and in which the parts of text (the PCDATA in XML terms) are used as an absolute reference system and as link between the levels of annotation (cf. Bayerl et al. (2003)).</Paragraph> <Paragraph position="2"> This representation format allows us to use various linguistic resources. A Python script converts the different annotation layers (XML Documents) to the above mentioned Prolog representation which serves as input for the unification process. The elements, attributes and text of all annotation layers are stored as Prolog predicates. As a requirement all files must be identical with respect to their underlying character data, what we call identity of primary data.</Paragraph> </Section> <Section position="2" start_page="85" end_page="85" type="sub_section"> <SectionTitle> 2.2 Unification </SectionTitle> <Paragraph position="0"> Figure 1 shows the architecture used for the unification process. Different annotation layers are unified, i.e., merged into one output fact base.</Paragraph> <Paragraph position="1"> A script reconverts the Prolog representation into one well-formed XML document containing the annotations from the layers plus the textual content. In case of overlapping elements (which may be the result of the unification), it converts those 3An additional representation format integrates the representation of multi-rooted trees developed in the NITE-Project (cf. Carletta et al. (2003)).</Paragraph> <Paragraph position="2"> elements to milestones or fragments (cf. TEI Guidelines (2004)) according to parameter options. A Java-based GUI is available to ease the use of the above mentioned framework.</Paragraph> </Section> </Section> <Section position="4" start_page="85" end_page="85" type="metho"> <SectionTitle> 3 Multi-rooted trees </SectionTitle> <Paragraph position="0"> Based on the before mentioned architecture we now focus on the usage of heterogeneous linguistic resources (as described in in Section 1) in order to semi-automatically add layers of markup to various XML (or plain text) documents. We prefer the term multi-rooted trees in favor of multiple annotations, i.e., different layers of annotation are stored in a single representation (based on the above mentioned architecture). The input document is separated into the textual information and the annotation tree (if there is any). Both of these are provided as input for linguistic resources. The output of this process (typically an XML annotated document) is again separated into text and markup and serves as input for another resource.</Paragraph> <Paragraph position="1"> Figure 2 gives an overview over the process.</Paragraph> </Section> <Section position="5" start_page="85" end_page="87" type="metho"> <SectionTitle> 4 Heterogeneous linguistic resources for </SectionTitle> <Paragraph position="0"> anaphora resolution: a case study We use multi-rooted trees in order to annotate and model coreferential phenomena and anaphoric relations (cf. Sasaki et al (2002)). Base for the resolution of anaphoric relations (both pronominal anaphora and definite description anaphora) is a small test corpus containing German newspaper articles and scientific articles in German and English. Figure 3 shows an excerpt of a German newspaper article taken from &quot;Die Zeit&quot;. In this example the first linguistic resource to apply is a parser (in our case Machinese Syntax by Connexor Oy). As a second step an XSLT script uses the input document and the parser output and tags discourse entities (see element de in Figure 3) by judging several syntactic criteria provided by the parser. The discourse entities mark the starting point to determine anaphora-antecedent-relations between pairs of discourse entities.</Paragraph> <Paragraph position="1"> In order to resolve bridging relations (e.g., &quot;door&quot; as a meronym of &quot;room&quot;), WordNet (GermaNet for German texts) is used as a linguistic resource to establish relationships between discourse entities according to the information stored in the synsets4. By now, we use an Open Source native XML database5 as test tool for querying the GermaNet data6. Resolving synonymous or hyperonymous anaphoric relations is done by using XPath or XQuery queries on pairs of discourse entities. Bridging relations are harder to track down and will be focused on in the near future.</Paragraph> <Paragraph position="2"> Figure 3 shows the shortened and manually revised output of the anaphora resolution. In this example two annotation layers have been merged: the logical document structure (in our case a moddified version of DocBook, doc) and the level of semantic relations (chs). The logical document structure describes the organisation of the text document in terms of chapters, sections, paragraphs, and the like. The level of semantic relations describes discourse entities and relations between them. Corpus investigations give rise to the supposition that the logical text structure influences the search scope of candidates for antecedents. Anaphoric relations are annotated with a cospecLink element (lines 12 to 15). The attribute relType holds the type of relation between two discourse entities. Line 15 is an example of an identity relation between discourse entity de n 078 (&quot;Marie Rolfs&quot;, line 6) and discourse entity de n 083 (&quot;sie&quot;, line 9) whereby the first is marked as antecedent.</Paragraph> </Section> class="xml-element"></Paper>