File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-2706_relat.xml
Size: 26,582 bytes
Last Modified: 2025-10-06 14:15:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2706"> <Title>Querying XML documents with multi-dimensional markup</Title> <Section position="3" start_page="43" end_page="49" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> After publishing the XML 1.0 recommendation the early proposals for XML query languages focused primarily on the representation of hierarchical dependencies between elements and the expression of properties of a single element. Typically, hierarchical relations are defined along parent/child and ancestor/descendant axis as done in XQL and XPath. XQL (Robie, 1998) supports positional relations between the elements in a sibling list. Sequences of elements can be queried by &quot;immediately precedes&quot; and &quot;precedes&quot; operators restricted on the siblings. Negation, conjunction and disjunction are defined as filtering functions specifying an element. XPath 1.0 (Clark and DeRose, 1999) is closely related addressing primarily the structural properties of an XML document by path expressions. Similarly to XQL sequences are defined on sibling lists. Working Draft for Xpath 2.0 (Berglund et al., September 2005) provides support for more data types than its precursor, especially for sequence types defining set operations on them.</Paragraph> <Paragraph position="1"> XML QL (Deutsch et al., 1999) follows the relational paradigm for XML queries, introduces variable binding to multiple nodes and regular expressions describing element paths.</Paragraph> <Paragraph position="2"> The queries are resolved using an XML graph as the data model, which allows both ordered and unordered node representation. XQuery (Boag et al., 2003) shares with XML QL the concept of variable bindings and the ability to define recursive functions. XQuery features more powerful iteration over elements by FLWR expression borrowed from Quilt (Chamberlin et al., 2001), string operations, &quot;if else&quot; case differentiation and aggregate functions. The demand for stronger support of querying annotated texts led to the integration of the full-text search in the language (Requirements, 2003) enabling full-text queries across the element boundaries.</Paragraph> <Paragraph position="3"> Hosoya and Pierce propose integration of XML queries in a programming language (Hosoya and Pierce, 2001) based on regular patterns Kleene's closure and union with the &quot;first-match&quot; semantics. Pattern variables can be declared and bound to the corresponding XML nodes during the matching process. A static type inference system for pattern variables is incorporated in XDuce (Hosoya and Pierce, 2003) - a functional language for XML processing. CDuce (Benzaken et al., 2003) extends XDuce by an efficient matching al- null gorithm for regular patterns and first class functions. A query language CQL based on regular patterns of CDuce uses CDuce as a query processor and allows efficient processing of XQuery expressions (Benzaken et al., 2005).</Paragraph> <Paragraph position="4"> The concept of fuzzy matching has been inroduced in query languages for IR (Carmel et al., 2003) relaxing the notion of context of an XML fragment.</Paragraph> <Paragraph position="5"> 3 Querying by pattern matching The general purpose of querying XML documents is to identify and process their fragments that satisfy certain criteria. We reduce the problem of querying XML to pattern matching. The patterns specify the query statement describing the desired properties of XML fragments while the matching fragments constitute the result of the query. Therefore the pattern language serves as the query language and its expressiveness is crucial for the capabilities of the queries. The scope for the query execution can be a collection of XML documents, a single document or analogously to XPath a subtree within a document with the current context node as its root. Since in the scope of the query there may be several XML fragments matching the pattern, multiple matches are treated according to the &quot;allmatch&quot; policy, i.e. all matching fragments are included in the result set. The pattern language does not currently support construction of new XML elements (however, it can be extended adding corresponding syntactic constructs). The result of the query is therefore a set of sequences of XML nodes from the document. Single sequences represent the XML fragments that match the query pattern. If no XML fragments in the query scope match the pattern, an empty result set is returned.</Paragraph> <Paragraph position="6"> In the following sections the semantics, main components and features of the pattern language are introduced and illustrated by examples. The complete EBNF specification of the language can be found on http://page.mi.fu-berlin.de/~siniakov/patlan.</Paragraph> <Section position="1" start_page="44" end_page="46" type="sub_section"> <SectionTitle> 3.1 Extended sequence semantics </SectionTitle> <Paragraph position="0"> Query languages based on path expressions usually return sets (or sequences) of elements that are conform with the original hierarchical structure of the document. In not uniformly structured XML documents, though, the hierarchical structure of the queried documents is unknown. The elements we may want to retrieve or their sequences can be arbitrarily nested. When retrieving the specified elements the nesting elements can be omitted disrupting the original hierarchical structure. Thus a sequence of elements does no longer have to be restricted to the sibling level and may be extended to a sequence of elements following each other on different levels of XML tree.</Paragraph> <Paragraph position="1"> V) from a chunk-parsed POS-tagged sentence.</Paragraph> <Paragraph position="2"> XML nodes are labeled with preorder numbered OID|right bound (maximum descendant OID) To illustrate the semantics and features of the language we will use the mentioned text mining scenario. In this particular text mining task some information in HTML documents with textual data should be found. The documents contain linguistic annotations inserted by POS tagger and syntactic chunk parser as XML elements that include the annotated text fragment as a text node. The XML output of the NLP tools is merged with the HTML markup so that various nestings are possible.</Paragraph> <Paragraph position="3"> A common technique to identify the relevant information is to match linguistic patterns describing it with the documents. The fragments of the documents that match are likely to contain relevant information. Hence the problem is to identify the fragments that match our linguistic patterns, that is, to answer the query where the queried fragments are described by linguistic patterns. Linguistic patterns comprise sequences of text fragments and XML elements added by NLP tools and are specified in our pattern language. When looking for linguistic patterns in an annotated HTML docu- null ment, it cannot be predicted how the linguistic elements are nested because nesting depends on syntactic structure of a sentence, HTML layout and the way both markups are merged.</Paragraph> <Paragraph position="4"> Basically, the problem of unpredictable nesting occurs in any document with a heterogeneous structure. Let us assume we would search for a sequence of POS tags: NE ADV V in a subtree of a HTML document depicted in fig. 1. Some POS tags are chunked in noun (NP), verb (VP) or prepositional phrases (PP). Named entity &quot;Nanosoft&quot; is emphasized in boldface and therefore nested by the HTML element <b>. Due to the syntactic structure and the HTML markup the elements NE, ADV and V are on different nesting levels and not children of the same element. According to the extended sequence semantics we can ignore the nesting elements we are not interested in (NPOID2 and bOID3 when matching NE, VPOID8 when matching V) so that the sequence (NEOID4, ADVOID6, VOID9) matches the sequence pattern NE ADV V, in short form</Paragraph> <Paragraph position="6"> By the previous example we introduced the matching relation [?]= as a binary relation [?]= [?] P x F where P is the set of patterns and F a set of XML fragments. An XML fragment f is a sequence of XML nodes n1 ...nn that belong to the subtree of the context node (i.e. the node whose subtree is queried, e.g.</Paragraph> <Paragraph position="7"> document root). Each XML node in the sub-tree is labeled by the pair OID|right bound.</Paragraph> <Paragraph position="8"> OID is obtained assigning natural numbers to the nodes during the preorder traversal.</Paragraph> <Paragraph position="9"> Right bound is the maximum OID of a descendant of the node - the OID of the right-most leaf in the rightmost subtree. To match a sequence pattern an XML fragment has to fulfil four important requirements.</Paragraph> <Paragraph position="10"> 1. Consecutiveness: All elements of the sequence pattern have to match the consecutive parts of the XML fragment 2. Order maintenance: Its elements must be in the &quot;tree order&quot;, i.e., the OIDs of the nodes according to the preorder numbering schema must be in ascending order.</Paragraph> <Paragraph position="11"> 3. Absence of overlaps: No node in the se- null quence can be the predecessor of any other node in the sequence on the way to the root. E.g. NP PP NP negationslash[?]= (NP11, PP18, NP21) because PP18 is a predecessor of NP21 and therefore subsumes it in its subtree. The semantics of the sequence implies that a sequence element cannot be subsumed by the previous one but has to follow it in another subtree. To determine whether a node m is a predecessor of the node n the OIDs of the nodes are compared. The predecessor must have a smaller OID according to the preorder numbering scheme, however any node in left subtrees of n has a smaller OID too.</Paragraph> <Paragraph position="12"> Therefore the right bounds of the nodes can be compared since the right bound of a predecessor will be greater or equal to the rightbound ofnwhile the rightbound of any element in the left subtree will be smaller:</Paragraph> <Paragraph position="14"> 4. Completeness: XML fragment must not contain any gaps, i.e. there should not be a node that is not in the XML fragment, not predecessor of one of the nodes, whose OID however lies between the OIDs of the fragment nodes. Since such a node is not a predecessor, it must be an element of the sequence; otherwise it is omitted and the sequence is not complete. Hence, the pattern V NP NP negationslash[?]= (V9, NP11, NP21) because the node PR19 lying between NP11 and NP21 is not a predecessor of any of the fragment nodes and not an element of the fragment. If the nodes lying between NP11 and NP21 cannot be exactly specified, we can use wildcard pattern (see sec. 3.3) to enable matching:</Paragraph> <Paragraph position="16"> Using these requirements we can formally specify the semantics of the sequence: Let s = s1 ...sk be a sequence pattern and</Paragraph> <Paragraph position="18"> The fourth requirement stresses the important aspect of &quot;exhaustive&quot; sequence: we are interested in a certain sequence of known elements that can be arbitrarily nested and captured by some elements that are irrelevant for our sequence (e.g. html layout elements when searching for a sequence of linguistic elements). We call such a sequence an exhaustive non-sibling sequence (ENSS). It is exhaustive because all predecessors omitted during the matching are covered at some level by the matching descendants so that there is no path to a leaf of the predecessor subtree that leads through an unmatched node. If such a path existed, the fourth requirement would not be met. If the sequence does not begin at the left-most branch or does not end at the rightmost branch of an omitted predecessor, the subtree of the respective predecessor is not fully covered. In ADJ NN PR [?]= (ADJ14, NN16, PR19) the omitted predecessors NP11 and PP18 are not completely a part of the sequence because they have descendants outside the sequence borders. Nevertheless the sequence is exhaustive since there is no path to a leaf through an unmatched node within its borders.</Paragraph> <Paragraph position="19"> Another important aspect of ENSS is that it can match XML fragments across the element borders. XPath imposes a query context by specifying the path expression that usually addresses a certain element, XQuery restricts it indirect by iterating over and binding variables to certain nodes. Matching ENSS there is no additional restriction of the query scope, that is, the sequence can begin and end at any node provided that the ENSS requirements are met.</Paragraph> <Paragraph position="20"> The dashed line in the fig. 1 points up the region covered by the sample sequence.</Paragraph> <Paragraph position="21"> According to the specification of the sequence pattern in the pattern language (cf.</Paragraph> <Paragraph position="22"> appendix ??): Pattern ::= Patternprime prime[?] Pattern any pattern can be the element of the sequence. Therefore the sequence can also contain textual elements, which is especially important when processing annotated texts.</Paragraph> <Paragraph position="23"> Textual nodes represent leaves in an XML tree and are treated as other XML nodes so that arbitrary combinations of XML elements and text are possible: &quot;released&quot; NP &quot;of&quot; NE [?]= (&quot;released&quot;10, NP11, &quot;of&quot;20, NE22) Exhaustive sequence allows a much greater abstraction from the DTD of a document than the usually used sequence of siblings. The expressiveness of the language significantly benefits from the combination of backtracking patterns (cf. sec. 3.3) with exhaustive sequence.</Paragraph> </Section> <Section position="2" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 3.2 Specification of XML nodes </SectionTitle> <Paragraph position="0"> Patterns matching single XML nodes are the primitives that the more complex patterns are composed from. The pattern language supports matching for document, element, attribute, text and CDATA nodes while some DOM node types such as entities and processing instructions are not supported. Some basic patterns matching element and text nodes have been already used as sequence elements in the previous section. Besides the simple addressing of an element by its name it is possible to specify the structure of its subtree: Pattern ::=prime \primeXML-Tag(prime[primePatternprime]prime)? A pattern specifying an element node will match if the element has the name corresponding to the XML-Tag and the pattern in the square brackets matches the XML fragment containing the sequence of its children. E.g.</Paragraph> <Paragraph position="1"> \PP[ PR NE] [?]= (PP18) because the name of the element is identical and PR NE [?]= (PR19, NE22). As this example shows, the extended sequence semantics applies also when the sequence is used as the inner pattern of another pattern. Therefore the specification of elements can benefit from the ENSS because we again do not have to know the exact structure of their subtrees, e.g. their children, but can specify the nodes we expect to occur in a certain order.</Paragraph> <Paragraph position="2"> Attribute nodes can be accessed by element pattern specifying the attribute values as a constraint: \V {@normal=&quot;release&quot;} [?]= (V9), assumed that the element V9 has the attribute &quot;normal&quot; that stores the principal form of its textual content. Besides equality tests, numeric comparisons and boolean functions on string attribute values can be used as constraints.</Paragraph> <Paragraph position="3"> Patterns specifying textual nodes comprise quoted strings:</Paragraph> </Section> <Section position="3" start_page="46" end_page="47" type="sub_section"> <SectionTitle> Pattern ::= QuotedString </SectionTitle> <Paragraph position="0"> and match a textual node of an XML element if it has the same textual content as the quoted string. Textual patterns can be used as ele- null ments of any other patterns as already demonstrated in the previous section. An element may be, for instance, described by a complex sequence of text nodes combined with other patterns: \sentence[NE * \V{@normal=release} \NP[* &quot;new&quot; &quot;version&quot;] &quot;of&quot; NE *] [?]= (sentence1) The pattern above can already be used as a linguistic pattern identifying the release of a new product version.</Paragraph> </Section> <Section position="4" start_page="47" end_page="48" type="sub_section"> <SectionTitle> 3.3 Backtracking patterns and variables </SectionTitle> <Paragraph position="0"> In contrast to the database-like XML documents featuring very rigid and repetitive structures annotated texts are distinguished by a very big structural variety. To handle this variety one needs patterns that can cover several different cases &quot;at once&quot;. So called backtracking patterns have this property and constitute therefore a substantial part of the pattern language. Their name comes from the fact that during the matching process backtracking is necessary to find a match.</Paragraph> <Paragraph position="1"> The pattern language features complex and primitive patterns. Complex patterns consist of at least one inner element that is a pattern itself. Primitive patterns are textual patterns or XML attribute and element specifications if the specification of the inner structure of the element is omitted, e.g. &quot;released&quot;, NP. If at least one of the inner patterns does not match, the matching of the complex pattern fails. Backtracking patterns except for wild-card pattern are complex patterns.</Paragraph> <Paragraph position="2"> Let us assume, we look for a sequence &quot;released&quot; NE and do not care what is between the two sequence elements. In the sub-tree depicted in fig. 1 no XML fragment will match because there are several nodes between &quot;released&quot;10 and NE22 and the completeness requirement is not met. If we include the wildcard pattern in the sequence, &quot;released&quot; * NE [?]= (&quot;released&quot;10 NP11 PR19 NE22), the wildcard pattern matches the nodes lying between V9 and NE22. Thus, every time we do not know what nodes can occur in a sequence or we are not interested in the nodes in some parts of the sequence, we can use wildcard pattern to specify the sequence without losing its completeness. Wild-card pattern matches parts of the sequence that are in turn sequences themselves. Therefore it matches only those XML fragments that fulfil the ENSS requirements II-IV. Since there are often multiple possibilities to match a sequence on different levels, wildcard matches nodes that are at the highest possible level such as NP11 in the previous example.</Paragraph> <Paragraph position="3"> If one does not know whether an XML fragment occurs, but wants to account for both cases the option pattern should be used: Pattern ::=prime (primePatternprime)?prime Pattern ::=prime (primePatternprime)[?]prime Kleene closure differs from the option by the infinite number of repetitions. It matches a sequence of any number of times repeated XML fragments that match the inner pattern of the Kleene closure pattern. Since Kleene closure matches sequences, the ENSS requirements have to be met by matching XML fragments.</Paragraph> <Paragraph position="4"> Let O = (p)? be an option, K = (p)[?] a Kleene closure pattern, f [?] F an XML fragment:</Paragraph> <Paragraph position="6"> where f fulfils ENSS requirements I-IV.</Paragraph> <Paragraph position="7"> The option pattern matches either an empty XML fragment or its inner pattern.</Paragraph> <Paragraph position="8"> An alternative occurrence of two XML fragments is covered by the union pattern: Pattern ::=prime (primePattern(prime|primePattern)+prime)prime Different order of nodes in the sequence can be captured in the permutation pattern: Pattern ::=prime (primePattern Pattern+prime)%prime Let U = (p1|p2) be a union pattern,</Paragraph> <Paragraph position="10"> Permutation can not be expressed by regular constructs and is therefore not a regular expression itself.</Paragraph> <Paragraph position="11"> The backtracking patterns can be arbitrarily combined to match complex XML fragments. E.g. the pattern ((PP |PR)? NP)% matches three XML fragments: (NP2), (NP11, PP18) and (PR19, NP21). Using the backtracking patterns recursively enlarges the expressivity of the patterns a lot allowing to specify very complex and variable structures without significant syntactic effort.</Paragraph> <Paragraph position="12"> Variables can be assigned to any pattern Pattern ::= Patternprime =:prime String accomplishing two functions. Whenever a variable is referenced within a pattern by the reference pattern Pattern ::=prime $primeStringprime$prime it evaluates to the pattern it was assigned to. The pattern (NP)[?]=:noun_phrase * $noun_phrase$ [?]= (NP2, ADV6, VP8, NP11) so that the referenced pattern matches NP11. A pattern referencing the variable v matches XML fragments that match the pattern that has been assigned to v. To make the matching results more persistent and enable further processing variables can be bound to the XML fragment that matched the pattern the variable is assigned to. After matching the pattern \sentence[NE=:company * \V{@normal=release} \NP[* &quot;new&quot; &quot;version&quot;] &quot;of&quot; NE=:product *] [?]= (sentence1) the variable company refers to NE4(Nanosoft) and product is bound to NE22(NanoOS).</Paragraph> <Paragraph position="13"> The relevant parts of XML fragment can be accessed by variables after a match has been found. Assigning variable to the wildcard pattern can be used to extract a subsequence between two known nodes: &quot;released&quot; * =:direct_object &quot;of&quot; [?]= (&quot;released&quot;10 NP11 &quot;of&quot;20) with the variable direct_object bound to NP11.</Paragraph> <Paragraph position="14"> Let A = p =: v be an assignment pattern: A [?]= f = p [?]= f Matching backtracking patterns can involve multiple matching variants of the same XML fragment, which usually leads to different variable bindings for each matching variant.</Paragraph> <Paragraph position="15"> As opposed to multiple matchings when different fragments match the same pattern discussed above, the first-match policy is applied when the pattern ambiguously matches a XML fragment. For instance,two different matching variants are possible for the pattern so that noun_phrase is bound to {} and noun_prep to (NP11, PR19). In such cases the first found match is returned as the final result. The order of processing of single patterns is determined by a convention.</Paragraph> </Section> <Section position="5" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 3.4 Negation </SectionTitle> <Paragraph position="0"> When querying an XML document it is often useful not only to specify what is expected but also to specify what should not occur. This is an efficient way to exclude some unwanted XML fragments from the query result because sometimes it is easier to characterize an XML fragment by not wanted rather than desirable properties. Regular languages (according to Chomsky's classification) are not capable of representing that something should not appear stating only what may or has to appear. In the pattern language the absence of some XML fragment can be specified by negation .</Paragraph> <Paragraph position="1"> As opposed to most XML query languages negation is a pattern and not a unary boolean operator. Therefore it has no boolean value, but matches the empty XML fragment.</Paragraph> <Paragraph position="2"> Since the negation pattern specifies what should not occur, it does not &quot;consume&quot; any XML nodes during the matching process so that we call it &quot;non-substantial&quot; negation. The negation pattern !(p) matches the empty XML fragment if its inner pattern p does not occur in the current context node. To underline the difference to logical negation, consider the double negation. The double negation !(!(p)) is not equivalent to p, but matches an empty XML element if !(p) matches the current context node, which is only true if the current context node is empty. Since the negation pattern only specifies what should not occur, the standalone usage of negation is not reasonable. It should be used as an inner pattern of other complex patterns. Specifying a sequence VP *=:wildcard_1 !(PR) *=:wildcard_2 NP we want to identify sequences starting with VP and ending with NP where PR is not within a sequence. Trying to find a match for the sequence starting in VP8 and ending in NP21 there are multiple matching variants for wildcard patterns. Some of them enable the matching of the negation pattern binding PR to one of the wildcards, e.g. wildcard_1 is bound to (NP11, PR19), !(PR) [?]= {}, wildcard_2 is bound to {}. However, there is a matching variant when the negated pattern is matched with PR19 (wildcard_1 is bound to NP11, wildcard_2 is bound to {}). We would certainly not want the sequence (VP8, NP11, PR19, NP21) to match our pattern because the occurrence of PR in the sequence should be avoided. Therefore we define the semantics of the negation so that there is no matching variant that enables the occurrence of negated pattern: Let P1 !(p) P2 be a complex pattern comprising negation as inner pattern. P1 and P2 are the left and right syntactic parts of the pattern and may be not valid patterns themselves (e.g. because of unmatched parentheses). The pattern obtained from the concatenation of both parts P1 P2 is a valid pattern because it is equivalent to the replacing of the negation by an empty pattern.</Paragraph> <Paragraph position="4"> Requiring P1 p P2 negationslash[?]= f guarantees that no matching variant exists in that the negated pattern p occurs. Since !(p) matches an empty fragment, the pattern P1P2 has to match complete f. It is noteworthy that the negation is the only pattern that influences the semantics of a complex pattern as its inner pattern. Independent of its complexity any pattern can be negated allowing very fine-grained specification of undesirable XML fragments.</Paragraph> </Section> </Section> class="xml-element"></Paper>