File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2706_intro.xml
Size: 4,588 bytes
Last Modified: 2025-10-06 14:04:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2706"> <Title>Querying XML documents with multi-dimensional markup</Title> <Section position="2" start_page="0" end_page="43" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> XML is widely used by NLP tools for annotating texts. Different NLP tools can produce overlapping annotations of text fragments.</Paragraph> <Paragraph position="1"> While a common way to cope with concurrent markup is using stand-off markup (Witt, 2004) with XPointer references to the annotated regions in the source document, another solution is to consolidate the annotations in a single document for easier processing. In that case concurrent markup has to be merged and accommodated in a single hierarchy. There are many ways to merge the overlapping markup so that different nesting structures are possible. Besides, the annotations have to be merged with the original markup of the document (e.g. in case of a HTML document).</Paragraph> <Paragraph position="2"> The problem of merging overlapping markup has been treated in (Siefkes, 2004) and we do not consider it here. Instead we focus on the problem of finding a universal querying mechanism for documents with multi-dimensional markup. The query language should abstract from the concrete merging algorithm for concurrent markup, that is to identify desired elements and sequences of elements independently from the concrete nesting structure. The development of the query language was motivated by an application in text mining.</Paragraph> <Paragraph position="3"> In some text mining systems the linguistic patterns that comprise text and XML annotations (such as syntactic annotations, POS tags) made by linguistic tools are matched with semistructured texts to find desired information. These texts can be HTML documents that are enriched with linguistic information by NLP tools and therefore contain multi-dimensional markup. The linguistic annotations are specified by XML elements that contain the annotated text fragment as CDATA.</Paragraph> <Paragraph position="4"> Due to the deliberate structure of the HTML document the annotations can be nested in arbitrary depth and vice versa - the linguistic XML element can contain some HTML elements with nested text it refers to. To find a linguistic pattern we have to abstract from the concrete DTD and actual structure of the XML document ignoring irrelevant markup, which leads to some kind of &quot;fuzzy&quot; matching. Hence it is sufficient to specify a sequence of text fragments and known XML elements (e.g. linguistic tags) without knowing by what elements they are nested. During the matching process the nesting markup will be omitted even if the sequence elements are on different nesting levels.</Paragraph> <Paragraph position="5"> We propose an expressive pattern language with the extended semantics of the sequence pattern, permutation, negation and regular patterns that is especially appropriate for querying XML annotated documents. The language provides a rich tool set for specifying complex sequences of XML elements and textual fragments. We ignore some important aspects of a fully-fledged XML query language such as construction of result sets, aggregate functions or support of all XML Schema structures focusing instead on the semantics of the language.</Paragraph> <Paragraph position="6"> Some modern XML query languages impose a relational view of data contained in the XML document aiming at retrieval of sets of elements with certain properties. While these approaches are adequate for database-like XML documents, they are less appropriate for documents in that XML is used rather for annotation than for representation of data. Taking the rather textual view of a XML document its querying can be regarded as finding patterns that comprise XML elements and textual content. One of the main differences when querying annotated texts is that the query typically captures parts of the document that go beyond the boundaries of a single element disrupting the XML tree structure while querying a database-like document returns its subtrees remaining within a scope of an element. Castagna (Castagna, 2005) distinguishes path expressions that rather correspond to the database view and regular expression patterns as complementary &quot;extraction primitives&quot; for XML data. Our approach enhances the concept of regular expression patterns making them mutually recursive and matching across the element boundaries.</Paragraph> </Section> class="xml-element"></Paper>